-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle parquet corner case: Columns with more rows than are in the row group. #11353
Handle parquet corner case: Columns with more rows than are in the row group. #11353
Conversation
…han are in their corresponding row group.
Codecov Report
@@ Coverage Diff @@
## branch-22.08 #11353 +/- ##
===============================================
Coverage ? 86.43%
===============================================
Files ? 143
Lines ? 22777
Branches ? 0
===============================================
Hits ? 19687
Misses ? 3090
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I'm not sure how we'd add a test for this case. It seems incorrect that a column have more rows than the row-group.
I have the same concern. We may need to have a unit test for the failed case if that is possible. |
@gpucibot merge |
There is a particularly odd corner case that can be constructed where a column in a parquet file has more rows in it than the associated row group specifies. Previously we were inadvertently handling this, however this optimization broke that support:
#11252
The solution is to cap the size of any non-list-child columns to the size of the selected row groups.
Leaving this as a draft while the changes percolate through the spark tests.