-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Add KLL Datasketch and Hive ColumnStatisticsObj as standard blo… #8202
base: main
Are you sure you want to change the base?
Changes from 5 commits
69df56f
abf4509
e8f311f
7f0c727
a4fdf6c
c0bba5d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -181,6 +181,23 @@ for Puffin v1. | |
[roaring-bitmap-portable-serialization]: https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#extension-for-64-bit-implementations | ||
[roaring-bitmap-general-layout]: https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#general-layout | ||
|
||
#### `hive-column-statistics-obj` blob type | ||
|
||
A serialized form of Hive ColumnStatsObject. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is what's referenced here the Thrift If so, I'd recommend correcting the name, and linking to the Thrift IDL, and explicitly calling out that this is Thrift-serialized. I'm also wondering if we need to think about versioning. If this is based on the Thrift IDL, I am not sure if those are intended to be persisted. At the very least, I am concerned if Hive decides to introduce a backwards-incompatible field to this struct, some engines begin to serialize with this newly introduced backwards incompatible field, and other engines begin to attempt to deserialize it with an older IDL, then it will break in the Iceberg library. Please let me know if I'm misunderstanding anything. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi, hive-column-statistics-obj referred here is a serialized using the org.apache.commons.lang3.SerializationUtils and stored into puffin file. |
||
|
||
The ColumnStatsObject supports Histograms, NDV, Min and Max values, Number of nulls, Number of trues, column name, type. | ||
A full list of supported statistics is listed in the table here: | ||
[ColumnStatistics](https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ColumnStatistics) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see much value in adding this. NDV sketches are already supported using Theta sketches so this would duplicate the purpose of an existing sketch. Is there sufficient value to justify the difference? This doesn't provide enough context to tell. In addition, the Iceberg manifest format already covers value count, lower bounds, upper bounds, number of nulls, and number of NaNs. The partition statistics files provide a way to aggregate those beyond the file level, and we use snapshot summaries for table-level stats. I'm not sure what value this would provide. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hi @rdblue,
If yes, could you please point me to the code where is that done? I had an impression that from colstats only NDV is calculated and stored in partition files. How about:
Wouldn't it make sense to store in a single puffin file an aggregated partition column stats object per table snapshot with all the values? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hi @rdblue, |
||
|
||
#### `apache-datasketches-KLL-sketch` blob type | ||
|
||
A serialized form of a "compact" KLL-sketch produced by the [Apache | ||
DataSketches](https://datasketches.apache.org/) library. | ||
Apache-Datasketches-KLL-sketch is an implementation of a very compact quantiles | ||
sketch with lazy compaction scheme and nearly optimal accuracy per bit. | ||
|
||
Histograms are derived from this sketch. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't much to go on. How is the sketch calculated? Does this support a single column or multiple columns? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. per column: |
||
|
||
### Compression codecs | ||
|
||
The data can also be uncompressed. If it is compressed the codec should be one of | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just FYI that adding changes to the Spec now requires a VOTE on the Dev mailing list in order to increase visibility for spec changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should I start [DISCUSS] thread first or go ahead with [VOTE]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also start with a DISCUSS thread first in order to have a short discussion on the introduced changes