-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding binary read/write as options for parquet #11160
Adding binary read/write as options for parquet #11160
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.08 #11160 +/- ##
===============================================
Coverage ? 86.43%
===============================================
Files ? 144
Lines ? 22808
Branches ? 0
===============================================
Hits ? 19714
Misses ? 3094
Partials ? 0 Continue to review full report at Codecov.
|
One requirement for this PR is to ensure that the statistics values are what we expect for binary columns. e.g. |
|
cpp/src/io/parquet/writer_impl.cu
Outdated
col_schema.type = Type::BYTE_ARRAY; | ||
if (col_meta.is_enabled_output_as_binary()) { | ||
col_schema.converted_type = ConvertedType::UNKNOWN; | ||
col_schema.stats_dtype = statistics_dtype::dtype_int8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using dtype_int8 here will result in only 4 bytes being written for the min/max parquet statistics. Would it be possible to add a dtype_binary to statistics_dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're on that path now. I don't believe this is the correct statistics data type for the binary column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will turn into a second PR to correct this. For now, the statistics_type will be string. The next PR is scheduled into the same release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpp review. LGTM.
Co-authored-by: MithunR <mythrocks@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor nitpicks, and requests for clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid. Few minor suggestions, nothing that improves correctness.
Note: All Spark plugin tests and integration tests pass with this branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍, after addressing the comments from the other reviewers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing all nitpicks!
💯
…2346/cudf into mwilson/parquet_writer_binary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python approval. One minor observation that should probably be fixed at some point, but is not a blocker for this PR.
@gpucibot merge |
… parquet (#11328) This is the last major feature in the byte array changes for parquet. This PR enables support for lists of bytes to be written as byte arrays in parquet files. This is a more efficient storage mechanism than what was used before. Limitations: - Only top-level lists are currently considered for writing. Some changes are necessary to allow nesting of these including dremel changes which are not here. This isn't a must-have yet, but is desired. - No dictionary support for lists of bytes. Dictionaries are supported for string columns, so the workaround is currently to change the column type to string before saving and using the option to write as byte arrays. This will require some more work with murmur hash to support. This is based on top of #11160 and should not merge until it does. Once that merges, the delta here will reduce a good deal. Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - MithunR (https://github.com/mythrocks) - Vukasin Milovanovic (https://github.com/vuule) URL: #11328
There are a couple of issues(#11044 and #10778) revolving around adding support for binary writes and reads to parquet. The desire is to be able to write strings and lists of int8 values as binary. This PR adds support for strings to be written as binary and for binary data to be read as binary or strings. I have left the default for binary data to read as a string to prevent any surprises upon upgrade.
Single-depth list columns of int8 and uint8 values are not written as binary with this change. That will be another PR after discussions about the possible impact of the change.
Closes #11044
Issue #10778