Core: Add KLL Datasketch and Hive ColumnStatisticsObj as standard blo… #8202

simhadri-g · 2023-08-01T10:21:18Z

…b types to puffin file

Hi Everyone,

Hive now supports writing column statistics to puffin files.

The statistics calculated by Hive include histograms, NDV (Number of Distinct Values), Min and Max values, the number of nulls, the number of true values, column name, and column type. You can find the full list of supported stats here: Link to GitHub.

We are updating the description of this PR to request incorporating the KLL datasketch for histograms.

As a result, we are looking to add KLL datasketch as standard blob types for the puffin file. Link to GitHub

Any feedback would be greatly appreciated.

Thanks!

core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java

simhadri-g · 2023-08-01T16:54:24Z

Thanks a lot for the review! :)

simhadri-g · 2023-08-03T13:50:29Z

@findepi I have addressed the review comments.

Can you please have a look at the PR when you are free?
Thanks in advance! :)

simhadri-g · 2023-08-16T07:54:13Z

Hello @findepi @nastra ,
I would greatly appreciate it if you could find some time to review the pull request.
Thanks!

core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java

simhadri-g · 2023-08-17T09:11:16Z

Thanks for the review :)
I will update the PR and get back.

simhadri-g · 2023-08-17T09:38:07Z

@nastra , I have updated the PR and moved the iceberg-docs(apache/iceberg-docs#269) changes to the main repo.

Please have a look when you are free.

Thanks again for the review!

format/puffin-spec.md

nastra

LGTM, @findepi could you review this as well please?

simhadri-g · 2023-08-18T19:10:03Z

thanks @nastra for the review! :)

@findepi Could you please have a look as well?
Thanks!

ZacBlanco · 2023-12-26T16:54:31Z

Any progress update here? It would be great to get the blessing from the Iceberg community for these as supported puffin blob types

bitsondatadev · 2023-12-29T14:16:02Z

@findepi @danielcweeks could one of you PTAL this? Don't want this effort to get lost.

bitsondatadev · 2024-02-01T11:48:13Z

@simhadri-g update I'm following up on this today.

simhadri-g · 2024-02-01T11:49:27Z

thanks @bitsondatadev !

zhangbutao · 2024-03-12T14:12:12Z

Any update? hope we can continue to push this review forward.
Thanks.

tdcmeehan · 2024-03-12T14:26:41Z

format/puffin-spec.md

@@ -126,6 +126,23 @@ The blob metadata for this blob may include following properties:

 - `ndv`: estimate of number of distinct values, derived from the sketch.

+#### `hive-column-statistics-obj` blob type
+
+A serialized form of Hive ColumnStatsObject.


Is what's referenced here the Thrift ColumnStatisticsObj in the Hive IDL? https://github.com/apache/hive/blob/ffb1165f59defa66b31b4fd9cb6367b71050071b/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift#L583

If so, I'd recommend correcting the name, and linking to the Thrift IDL, and explicitly calling out that this is Thrift-serialized.

I'm also wondering if we need to think about versioning. If this is based on the Thrift IDL, I am not sure if those are intended to be persisted. At the very least, I am concerned if Hive decides to introduce a backwards-incompatible field to this struct, some engines begin to serialize with this newly introduced backwards incompatible field, and other engines begin to attempt to deserialize it with an older IDL, then it will break in the Iceberg library.

Please let me know if I'm misunderstanding anything.

Hi,
In hive , we writes the statistics to HMS in addition to puffin files.
So the Thrift ColumnStatisticsObj is used to write the statistics to HMS.

hive-column-statistics-obj referred here is a serialized using the org.apache.commons.lang3.SerializationUtils and stored into puffin file.

simhadri-g · 2024-06-28T11:58:57Z

Hi everyone,
I would be most grateful if we could get help with reviewing this.
Thanks!

github-actions · 2024-09-10T00:14:50Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-09-18T00:14:30Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

deniskuzZ · 2025-02-03T14:30:31Z

hi @simhadri-g, would you mind reopening this PR?

simhadri-g · 2025-02-04T07:30:09Z

Hi Denys!
I think , I don't have the github permissions to reopen a closed PR in iceberg repo.
Should i raise a new PR?

deniskuzZ · 2025-02-04T07:34:40Z

Hi Denys! I think , I don't have the github permissions to reopen a closed PR in iceberg repo. Should i raise a new PR?

hey @simhadri-g, if you'd like to pursue this I would recommend opening a new one, but split into 2 parts: Hive ColumnStatistics and KLL Datasketch

deniskuzZ · 2025-02-04T07:38:48Z

thanks @nastra! It would really help if we could get this in.

simhadri-g · 2025-02-04T07:41:16Z

thanks @nastra and @deniskuzZ !

I will resolve the merge conflicts and update this PR. :)

…b types to puffin file

…o the main iceberg/format

Co-authored-by: Eduard Tudenhoefner <[email protected]>

nastra · 2025-02-04T11:20:03Z

@findepi Could you please have a look as well?

nastra · 2025-02-04T11:21:51Z

format/puffin-spec.md

@@ -181,6 +181,23 @@ for Puffin v1.
 [roaring-bitmap-portable-serialization]: https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#extension-for-64-bit-implementations
 [roaring-bitmap-general-layout]: https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#general-layout

+#### `hive-column-statistics-obj` blob type


just FYI that adding changes to the Spec now requires a VOTE on the Dev mailing list in order to increase visibility for spec changes

should I start [DISCUSS] thread first or go ahead with [VOTE]?

you can also start with a DISCUSS thread first in order to have a short discussion on the introduced changes

rdblue · 2025-02-04T17:37:51Z

format/puffin-spec.md

+
+The ColumnStatsObject supports Histograms, NDV, Min and Max values, Number of nulls, Number of trues, column name, type.
+A full list of supported statistics is listed in the table here:
+[ColumnStatistics](https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ColumnStatistics)


I don't see much value in adding this.

NDV sketches are already supported using Theta sketches so this would duplicate the purpose of an existing sketch. Is there sufficient value to justify the difference? This doesn't provide enough context to tell.

In addition, the Iceberg manifest format already covers value count, lower bounds, upper bounds, number of nulls, and number of NaNs. The partition statistics files provide a way to aggregate those beyond the file level, and we use snapshot summaries for table-level stats. I'm not sure what value this would provide.

hi @rdblue,
thanks for checking this PR!

The partition statistics files provide a way to aggregate those beyond the file level
Does iceberg provide build-in support to get an aggregated Column stats? I mean, is there some library/service that generates partition files with an aggregated column stats?
AFAIK we only do this for basic stats : #11216

If yes, could you please point me to the code where is that done? I had an impression that from colstats only NDV is calculated and stored in partition files.

How about:

bitvectors - used to improve stats estimations for IN operator

histogram - histogram statistics, which are particularly useful for skewed data and range predicates (KLL data sketches)

numTrue/numFalse

avgColLen

Wouldn't it make sense to store in a single puffin file an aggregated partition column stats object per table snapshot with all the values?

hi @rdblue,
I'm sorry, it seems we had pretty limited knowledge in that area and now I think we finally get your point.
I've drafted a small doc with the proposal and our intent: https://docs.google.com/document/d/11Rp-irqb4L4Qpdxr6l83bA4IRsfw3AAyR8wokNe1r80/edit?usp=sharing
Could you please take a quick look and suggest if that is a valid proposal.
Thank you!

rdblue · 2025-02-04T17:39:19Z

format/puffin-spec.md

+Apache-Datasketches-KLL-sketch is an implementation of a very compact quantiles
+sketch with lazy compaction scheme and nearly optimal accuracy per bit.
+
+Histograms are derived from this sketch.


This isn't much to go on. How is the sketch calculated? Does this support a single column or multiple columns?

per column: KllSketchUDF(columnName, columnStatsType)
more details: https://issues.apache.org/jira/browse/HIVE-26221

danielcweeks · 2025-02-04T23:37:34Z

I would agree with @rdblue's comment that there's a lot of stats overlap with what exists elsewhere and I'm not convinced this serialized format is standardized well enough to be used easily outside of the Hive/Impala projects.

Could we identify the specific stats that we feel are valuable and focus on those independently of what is captured elsewhere?

deniskuzZ · 2025-02-05T13:01:57Z

I would agree with @rdblue's comment that there's a lot of stats overlap with what exists elsewhere and I'm not convinced this serialized format is standardized well enough to be used easily outside of the Hive/Impala projects.

Could we identify the specific stats that we feel are valuable and focus on those independently of what is captured elsewhere?

Hi @danielcweeks,
Thanks for your reply. I think we might have communicated our proposal not very clearly and had limited knowledge of that area.
I've drafted a doc with the proposal: https://docs.google.com/document/d/11Rp-irqb4L4Qpdxr6l83bA4IRsfw3AAyR8wokNe1r80/edit?usp=sharing
Hopefully, it would help to understand our intent. Could you please take a look?
Thank you!

github-actions bot added the core label Aug 1, 2023

nastra requested a review from findepi August 1, 2023 11:31

findepi reviewed Aug 1, 2023

View reviewed changes

simhadri-g mentioned this pull request Aug 2, 2023

Add KLL Datasketch and Hive ColumnStatisticsObj as standard blob type… apache/iceberg-docs#269

Closed

simhadri-g requested a review from findepi August 3, 2023 13:51

nastra reviewed Aug 16, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java Outdated Show resolved Hide resolved

nastra reviewed Aug 16, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java Outdated Show resolved Hide resolved

nastra reviewed Aug 16, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java Outdated Show resolved Hide resolved

nastra reviewed Aug 17, 2023

View reviewed changes

format/puffin-spec.md Outdated Show resolved Hide resolved

nastra approved these changes Aug 17, 2023

View reviewed changes

tdcmeehan reviewed Mar 12, 2024

View reviewed changes

findepi mentioned this pull request Jun 14, 2024

Spark Action to Analyze table #10288

Merged

github-actions bot added the stale label Sep 10, 2024

github-actions bot closed this Sep 18, 2024

nastra reopened this Feb 4, 2025

github-actions bot added the Specification Issues that may introduce spec changes. label Feb 4, 2025

nastra added not-stale and removed Specification Issues that may introduce spec changes. stale labels Feb 4, 2025

simhadri-g and others added 5 commits February 4, 2025 08:45

Core: Add KLL Datasketch and Hive ColumnStatisticsObj as standard blo…

69df56f

…b types to puffin file

Addressed review comments

abf4509

Addressed review comments and moved the doc update from iceberg-doc t…

e8f311f

…o the main iceberg/format

Update a typo in format/puffin-spec.md

7f0c727

Co-authored-by: Eduard Tudenhoefner <[email protected]>

Fix checkstyle error flagged by :iceberg-core:spotlessJavaCheck

a4fdf6c

deniskuzZ force-pushed the puffin-blob-type branch from bec46a7 to a4fdf6c Compare February 4, 2025 09:01

github-actions bot added the Specification Issues that may introduce spec changes. label Feb 4, 2025

nastra reviewed Feb 4, 2025

View reviewed changes

rdblue reviewed Feb 4, 2025

View reviewed changes

removed Hive ColumnStatsObject

c0bba5d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Add KLL Datasketch and Hive ColumnStatisticsObj as standard blo… #8202

Core: Add KLL Datasketch and Hive ColumnStatisticsObj as standard blo… #8202

simhadri-g commented Aug 1, 2023 •

edited

Loading

simhadri-g commented Aug 1, 2023

simhadri-g commented Aug 3, 2023

simhadri-g commented Aug 16, 2023

simhadri-g commented Aug 17, 2023

simhadri-g commented Aug 17, 2023

nastra left a comment

simhadri-g commented Aug 18, 2023

ZacBlanco commented Dec 26, 2023

bitsondatadev commented Dec 29, 2023

bitsondatadev commented Feb 1, 2024

simhadri-g commented Feb 1, 2024

zhangbutao commented Mar 12, 2024

tdcmeehan Mar 12, 2024

simhadri-g Jun 28, 2024

simhadri-g commented Jun 28, 2024

github-actions bot commented Sep 10, 2024

github-actions bot commented Sep 18, 2024

deniskuzZ commented Feb 3, 2025

simhadri-g commented Feb 4, 2025 •

edited

Loading

deniskuzZ commented Feb 4, 2025 •

edited

Loading

deniskuzZ commented Feb 4, 2025 •

edited

Loading

simhadri-g commented Feb 4, 2025

nastra commented Feb 4, 2025

nastra Feb 4, 2025

deniskuzZ Feb 4, 2025

nastra Feb 4, 2025

rdblue Feb 4, 2025

deniskuzZ Feb 4, 2025 •

edited

Loading

deniskuzZ Feb 5, 2025

rdblue Feb 4, 2025

deniskuzZ Feb 4, 2025 •

edited

Loading

danielcweeks commented Feb 4, 2025

deniskuzZ commented Feb 5, 2025 •

edited

Loading

Core: Add KLL Datasketch and Hive ColumnStatisticsObj as standard blo… #8202

Are you sure you want to change the base?

Core: Add KLL Datasketch and Hive ColumnStatisticsObj as standard blo… #8202

Conversation

simhadri-g commented Aug 1, 2023 • edited Loading

simhadri-g commented Aug 1, 2023

simhadri-g commented Aug 3, 2023

simhadri-g commented Aug 16, 2023

simhadri-g commented Aug 17, 2023

simhadri-g commented Aug 17, 2023

nastra left a comment

Choose a reason for hiding this comment

simhadri-g commented Aug 18, 2023

ZacBlanco commented Dec 26, 2023

bitsondatadev commented Dec 29, 2023

bitsondatadev commented Feb 1, 2024

simhadri-g commented Feb 1, 2024

zhangbutao commented Mar 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simhadri-g commented Jun 28, 2024

github-actions bot commented Sep 10, 2024

github-actions bot commented Sep 18, 2024

deniskuzZ commented Feb 3, 2025

simhadri-g commented Feb 4, 2025 • edited Loading

deniskuzZ commented Feb 4, 2025 • edited Loading

deniskuzZ commented Feb 4, 2025 • edited Loading

simhadri-g commented Feb 4, 2025

nastra commented Feb 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deniskuzZ Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deniskuzZ Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

danielcweeks commented Feb 4, 2025

deniskuzZ commented Feb 5, 2025 • edited Loading

simhadri-g commented Aug 1, 2023 •

edited

Loading

simhadri-g commented Feb 4, 2025 •

edited

Loading

deniskuzZ commented Feb 4, 2025 •

edited

Loading

deniskuzZ commented Feb 4, 2025 •

edited

Loading

deniskuzZ Feb 4, 2025 •

edited

Loading

deniskuzZ Feb 4, 2025 •

edited

Loading

deniskuzZ commented Feb 5, 2025 •

edited

Loading