Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify Parquet writer to produce column indexes #11171

Closed
wants to merge 137 commits into from

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jun 29, 2022

This PR addresses #9268 and closes #11038. It's still a work in progress and I would like feedback about my approach.

The column indexes are actually two different structures. The column index itself which is essentially per-page min/max statistics, and the offset index which stores each page's location, compressed size, and first row index. Since the column index contains information already in the EncColumnChunk structure, I calculate and encode the column index per chunk on device, storing the result in a blob I added to the EncColumnChunk struct. The offset index requires information available only after writing the file, so it is created on the CPU and stored in the aggregate_writer_metadata struct. The indexes themselves are then written to the file before the footer.

As part of this work, I've also included truncation of the statistics values, as recommended by the Parquet format. I've added a parameter column_index_truncate_length to the writer options/builder. It currently defaults to 64, which is the default used by parquet-mr. The truncation code required the addition of some UTF8 helper functions, some of which may no longer be needed after #11160 is merged.

Also, while writing the code to determine the sort order of the pages, I found that decimal128 statistics were not handled correctly (although the actual values were written properly). There are changes to statistics_type_identification.cuh, temp_storage_wrapper.cuh, and typed_statistics_chunk.cuh to address this. I'm not sure what impact these changes would have on the ORC writer.

I've added some python and java bindings, but haven't completed them yet. I've added the unit tests I could think of, but welcome suggestions for further tests.

Sorry for the huge PR, but it's due to circumstances beyond my control that I couldn't submit this as several PRs. :(

etseidl and others added 30 commits May 27, 2022 13:53
  pass stats to gatherPages
  formatting
this should make additional ColumnIndex and OffsetIndex tests easier
to write
implement readers for ColumnIndex, OffsetIndex, and PageLocation
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@rapids-bot
Copy link

rapids-bot bot commented Jun 29, 2022

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

@github-actions github-actions bot added Java Affects Java cuDF API. Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Jun 29, 2022
rapids-bot bot pushed a commit that referenced this pull request Jul 4, 2022
Adds some necessary structs to parquet.hpp as well as methods to CompactProtocolReader/Writer to address #9268

I can add tests if necessary once #11177 is merged, or testing can be deferred to be included in a future PR (based on #11171)

Authors:
  - https://github.com/etseidl

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #11178
@etseidl etseidl closed this Jul 8, 2022
@etseidl
Copy link
Contributor Author

etseidl commented Jul 8, 2022

Closing for now. Will resubmit slimmer PR once #11179 is merged.

@etseidl etseidl deleted the feature/column_index branch August 3, 2022 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Improve parquet CheckPageRows test
3 participants