-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC] Add libcudf policy that list columns must be sanitized #11748
Comments
The issue only reproes with list columns that have non-empty null elements. Modifying the failing test to have zero-length null lists fixes the test. |
So we may end up to always calling |
It depends on where the data is coming from. For example, cuIO readers do not create such columns. Maybe @mythrocks can comment on the cases in which we can create non-conforming columns. |
…ization for nested preprocessing (#11752) Fixes an issue where using user bounds with parquet files containing both nested and non-nested types could result in incorrect row counts for the non-nested columns. Originally reported by @etseidl The nature of the fix also implements a longstanding desired optimization: when running the preprocess step for nested types, ignore pages for non-nested hierarchies. This can result in significant speedups for files containing only a few nested columns. <s>The tests added for this PR seem to tease a bug in the parquet writer into happening (#11748) so I will leave this as a draft until that issue is resolved.</s> Authors: - https://github.com/nvdbaranec Approvers: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: #11752
I will convert this into a documentation issue. The update will be needed in https://github.com/rapidsai/cudf/blob/branch-22.12/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#list-columns |
I think this is solved by #11853 unless we want to copy the advice into another location. |
Agreed. Closing. |
I found a case where the parquet writer seems to be incorrectly writing the validity data for the top level of a list column. It ends up writing out an all-valid mask, when there are actually nulls. Confirmed this is a writer issue by loading the resulting file in external tools as well. ORC writer does not exhibit this behavior.
Repro:
Expected (nested float column not showed)
Got
The text was updated successfully, but these errors were encountered: