Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle parquet corner case: Columns with more rows than are in the row group. #11353

Merged

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Jul 26, 2022

There is a particularly odd corner case that can be constructed where a column in a parquet file has more rows in it than the associated row group specifies. Previously we were inadvertently handling this, however this optimization broke that support:

#11252

The solution is to cap the size of any non-list-child columns to the size of the selected row groups.

Leaving this as a draft while the changes percolate through the spark tests.

@nvdbaranec nvdbaranec added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Jul 26, 2022
@nvdbaranec nvdbaranec requested a review from a team as a code owner July 26, 2022 16:35
@nvdbaranec nvdbaranec marked this pull request as draft July 26, 2022 16:54
@codecov
Copy link

codecov bot commented Jul 26, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@8faf2b0). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 8f234a5 differs from pull request most recent head b990524. Consider uploading reports for the commit b990524 to get more accurate results

@@               Coverage Diff               @@
##             branch-22.08   #11353   +/-   ##
===============================================
  Coverage                ?   86.43%           
===============================================
  Files                   ?      143           
  Lines                   ?    22777           
  Branches                ?        0           
===============================================
  Hits                    ?    19687           
  Misses                  ?     3090           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8faf2b0...b990524. Read the comment docs.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

I'm not sure how we'd add a test for this case. It seems incorrect that a column have more rows than the row-group.

cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
@nvdbaranec nvdbaranec changed the title Handle parquet corner case: Columns with more rows than are in the page group. Handle parquet corner case: Columns with more rows than are in the row group. Jul 27, 2022
@ttnghia
Copy link
Contributor

ttnghia commented Jul 27, 2022

I'm not sure how we'd add a test for this case.

I have the same concern. We may need to have a unit test for the failed case if that is possible.

@nvdbaranec nvdbaranec requested a review from ttnghia July 27, 2022 20:51
@nvdbaranec
Copy link
Contributor Author

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants