Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add column indexes to Parquet writer #11302

Merged
merged 85 commits into from
Jul 26, 2022

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jul 19, 2022

Closes #9268.

The column indexes are actually two different structures. The column index itself which is essentially per-page min/max statistics, and the offset index which stores each page's location, compressed size, and first row index. Since the column index contains information already in the EncColumnChunk structure, I calculate and encode the column index per chunk on device, storing the result in a blob I added to the EncColumnChunk struct. The offset index requires information available only after writing the file, so it is created on the CPU and stored in the aggregate_writer_metadata struct. The indexes themselves are then written to the file before the footer.

The current implementation does not include truncation of the statistics as recommended. This will be addressed in a later PR.

etseidl and others added 30 commits June 29, 2022 16:46
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
@etseidl
Copy link
Contributor Author

etseidl commented Jul 22, 2022

@nvdbaranec and @hyperbolic2346 , thanks for the comments! Keep 'em coming!

I think I've addressed most of them, but @nvdbaranec could you expand on the comment about calling functions with side effects from CUDF_EXPECTS. I'm happy to change, but just wondering what the downside is.

@hyperbolic2346
Copy link
Contributor

hyperbolic2346 commented Jul 22, 2022

could you expand on the comment about calling functions with side effects from CUDF_EXPECTS. I'm happy to change, but just wondering what the downside is.

If I can speak for him, since he is probably quit for the day, this is a sticky point in general related to macros. In past lives, we had macros like ASSERT(condition, reason), and people would put statements with side-effects into the condition. This worked until ASSERT was compiled to nothing in release builds and then suddenly things stopped working until you tried to debug it.

I don't think we actually have a problem here, but my eye twitches when I see it as well.

@etseidl
Copy link
Contributor Author

etseidl commented Jul 22, 2022

fair enough. don't want to cause any stress. I'll start reworking those tomorrow. It's beer-o-clock now :)

@etseidl
Copy link
Contributor Author

etseidl commented Jul 22, 2022

I think I'm done with this round of fixes. Question: are there data types that don't support statistics at all (statistics_chunk.has_minmax is 0)? I don't have any tests for that if there are.

@etseidl etseidl requested a review from nvdbaranec July 24, 2022 17:30
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
Co-authored-by: Yunsong Wang <wangyunsong89@gmail.com>
@hyperbolic2346
Copy link
Contributor

I think I'm done with this round of fixes. Question: are there data types that don't support statistics at all (statistics_chunk.has_minmax is 0)? I don't have any tests for that if there are.

I don't know of any, but I'm not an expert here. :)

@hyperbolic2346
Copy link
Contributor

These changes are a huge step in a great direction. I appreciate you taking the time to implement our suggestions.

@vuule
Copy link
Contributor

vuule commented Jul 26, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 96f747b into rapidsai:branch-22.08 Jul 26, 2022
@etseidl etseidl deleted the feature/colidx branch July 28, 2022 19:53
rapids-bot bot pushed a commit that referenced this pull request Sep 1, 2022
#11302 added `STATISTICS_COLUMN` to the `statistics_freq` enum in libcudf.  This adds the same to python.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #11453
rapids-bot bot pushed a commit that referenced this pull request Nov 2, 2022
Adds `statistics_freq::STATISTICS_COLUMN` to list of parquet writer options to benchmark.  This should have been included in #11302.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #11955
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Parquet writer to include Column Index feature
9 participants