-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify Parquet writer to produce column indexes #11171
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pass stats to gatherPages formatting
this should make additional ColumnIndex and OffsetIndex tests easier to write
implement readers for ColumnIndex, OffsetIndex, and PageLocation
truncate is broken for 2 and 4 byte values UTF-8 now maxes as 4 bytes, so fix that as well
Can one of the admins verify this patch? |
Pull requests from external contributors require approval from a |
github-actions
bot
added
Java
Affects Java cuDF API.
Python
Affects Python cuDF API.
libcudf
Affects libcudf (C++/CUDA) code.
labels
Jun 29, 2022
This was referenced Jun 30, 2022
rapids-bot bot
pushed a commit
that referenced
this pull request
Jul 4, 2022
Adds some necessary structs to parquet.hpp as well as methods to CompactProtocolReader/Writer to address #9268 I can add tests if necessary once #11177 is merged, or testing can be deferred to be included in a future PR (based on #11171) Authors: - https://github.com/etseidl Approvers: - Devavret Makkar (https://github.com/devavret) - Yunsong Wang (https://github.com/PointKernel) URL: #11178
Closing for now. Will resubmit slimmer PR once #11179 is merged. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR addresses #9268 and closes #11038. It's still a work in progress and I would like feedback about my approach.
The column indexes are actually two different structures. The column index itself which is essentially per-page min/max statistics, and the offset index which stores each page's location, compressed size, and first row index. Since the column index contains information already in the EncColumnChunk structure, I calculate and encode the column index per chunk on device, storing the result in a blob I added to the EncColumnChunk struct. The offset index requires information available only after writing the file, so it is created on the CPU and stored in the aggregate_writer_metadata struct. The indexes themselves are then written to the file before the footer.
As part of this work, I've also included truncation of the statistics values, as recommended by the Parquet format. I've added a parameter
column_index_truncate_length
to the writer options/builder. It currently defaults to 64, which is the default used by parquet-mr. The truncation code required the addition of some UTF8 helper functions, some of which may no longer be needed after #11160 is merged.Also, while writing the code to determine the sort order of the pages, I found that decimal128 statistics were not handled correctly (although the actual values were written properly). There are changes to statistics_type_identification.cuh, temp_storage_wrapper.cuh, and typed_statistics_chunk.cuh to address this. I'm not sure what impact these changes would have on the ORC writer.
I've added some python and java bindings, but haven't completed them yet. I've added the unit tests I could think of, but welcome suggestions for further tests.
Sorry for the huge PR, but it's due to circumstances beyond my control that I couldn't submit this as several PRs. :(