Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing of Parquet legacy list-of-struct format #9380

Merged
merged 1 commit into from
Oct 13, 2023

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Oct 4, 2023

Fixes #8631. Depends on rapidsai/cudf#13715 and NVIDIA/spark-rapids-jni#1475.

Schema checking was not properly handling the legacy array-of-struct encoding where the list can contain more than one child. Updated the logic to handle that case which, along with the corresponding fixes from cudf and spark-rapids-jni, allows the repeated_no_annotation.parquet file to be properly decoded.

@jlowe jlowe added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Oct 4, 2023
@jlowe jlowe self-assigned this Oct 4, 2023
Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@jlowe jlowe force-pushed the fix-parquet-repeated-no-annotation branch from 3a0a595 to 0cacc00 Compare October 4, 2023 17:45
@jlowe jlowe marked this pull request as ready for review October 9, 2023 18:40
@jlowe
Copy link
Member Author

jlowe commented Oct 9, 2023

build

@jlowe jlowe merged commit 5b84728 into NVIDIA:branch-23.12 Oct 13, 2023
29 checks passed
@jlowe jlowe deleted the fix-parquet-repeated-no-annotation branch October 13, 2023 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Parquet load failure on repeated_no_annotation.parquet
2 participants