Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Add KLL Datasketch and Hive ColumnStatisticsObj as standard blo… #8202

Closed
wants to merge 5 commits into from

Conversation

simhadri-g
Copy link
Member

…b types to puffin file

Issue:
#8198

Hi Everyone,

Hive now supports writing column statistics to puffin files.

The statistics calculated by Hive include histograms, NDV (Number of Distinct Values), Min and Max values, the number of nulls, the number of true values, column name, and column type. You can find the full list of supported stats here: Link to GitHub.

These statistics are stored as a Hive columnStatistics object, which is serialized and saved as a blob in puffin. You can refer to the code here for more information: Link to GitHub.

Currently, this object is supported by Hive and partially by Impala as well: Link to GitHub. We also plan to incorporate the KLL datasketch for histograms.

As a result, we are looking to add columnStatistics object and KLL datasketch as standard blob types for the puffin file. Link to GitHub

Any feedback would be greatly appreciated.

Thanks!

@github-actions github-actions bot added the core label Aug 1, 2023
@nastra nastra requested a review from findepi August 1, 2023 11:31
@simhadri-g
Copy link
Member Author

Thanks a lot for the review! :)

@simhadri-g
Copy link
Member Author

@findepi I have addressed the review comments.

Can you please have a look at the PR when you are free?
Thanks in advance! :)

@simhadri-g simhadri-g requested a review from findepi August 3, 2023 13:51
@simhadri-g
Copy link
Member Author

Hello @findepi @nastra ,
I would greatly appreciate it if you could find some time to review the pull request.
Thanks!

@simhadri-g
Copy link
Member Author

Thanks for the review :)
I will update the PR and get back.

@simhadri-g
Copy link
Member Author

@nastra , I have updated the PR and moved the iceberg-docs(apache/iceberg-docs#269) changes to the main repo.

Please have a look when you are free.

Thanks again for the review!

format/puffin-spec.md Outdated Show resolved Hide resolved
Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @findepi could you review this as well please?

@simhadri-g
Copy link
Member Author

thanks @nastra for the review! :)

@findepi Could you please have a look as well?
Thanks!

@ZacBlanco
Copy link

Any progress update here? It would be great to get the blessing from the Iceberg community for these as supported puffin blob types

@bitsondatadev
Copy link
Collaborator

@findepi @danielcweeks could one of you PTAL this? Don't want this effort to get lost.

@bitsondatadev
Copy link
Collaborator

@simhadri-g update I'm following up on this today.

@simhadri-g
Copy link
Member Author

thanks @bitsondatadev !

@zhangbutao
Copy link
Contributor

Any update? hope we can continue to push this review forward.
Thanks.

@@ -126,6 +126,23 @@ The blob metadata for this blob may include following properties:

- `ndv`: estimate of number of distinct values, derived from the sketch.

#### `hive-column-statistics-obj` blob type

A serialized form of Hive ColumnStatsObject.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is what's referenced here the Thrift ColumnStatisticsObj in the Hive IDL? https://github.com/apache/hive/blob/ffb1165f59defa66b31b4fd9cb6367b71050071b/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift#L583

If so, I'd recommend correcting the name, and linking to the Thrift IDL, and explicitly calling out that this is Thrift-serialized.

I'm also wondering if we need to think about versioning. If this is based on the Thrift IDL, I am not sure if those are intended to be persisted. At the very least, I am concerned if Hive decides to introduce a backwards-incompatible field to this struct, some engines begin to serialize with this newly introduced backwards incompatible field, and other engines begin to attempt to deserialize it with an older IDL, then it will break in the Iceberg library.

Please let me know if I'm misunderstanding anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
In hive , we writes the statistics to HMS in addition to puffin files.
So the Thrift ColumnStatisticsObj is used to write the statistics to HMS.

hive-column-statistics-obj referred here is a serialized using the org.apache.commons.lang3.SerializationUtils and stored into puffin file.

@simhadri-g
Copy link
Member Author

Hi everyone,
I would be most grateful if we could get help with reviewing this.
Thanks!

Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 10, 2024
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants