Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding binary read/write as options for parquet #11160

Merged

Conversation

hyperbolic2346
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 commented Jun 28, 2022

There are a couple of issues(#11044 and #10778) revolving around adding support for binary writes and reads to parquet. The desire is to be able to write strings and lists of int8 values as binary. This PR adds support for strings to be written as binary and for binary data to be read as binary or strings. I have left the default for binary data to read as a string to prevent any surprises upon upgrade.

Single-depth list columns of int8 and uint8 values are not written as binary with this change. That will be another PR after discussions about the possible impact of the change.

Closes #11044
Issue #10778

@hyperbolic2346 hyperbolic2346 added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. cuIO cuIO issue Java Affects Java cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 28, 2022
@hyperbolic2346 hyperbolic2346 requested a review from a team as a code owner June 28, 2022 00:55
@hyperbolic2346 hyperbolic2346 self-assigned this Jun 28, 2022
@hyperbolic2346 hyperbolic2346 requested review from a team as code owners June 28, 2022 00:55
@codecov
Copy link

codecov bot commented Jun 28, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@0f860ea). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #11160   +/-   ##
===============================================
  Coverage                ?   86.43%           
===============================================
  Files                   ?      144           
  Lines                   ?    22808           
  Branches                ?        0           
===============================================
  Hits                    ?    19714           
  Misses                  ?     3094           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0f860ea...c45b454. Read the comment docs.

@hyperbolic2346 hyperbolic2346 removed the improvement Improvement / enhancement to an existing function label Jun 28, 2022
@devavret
Copy link
Contributor

One requirement for this PR is to ensure that the statistics values are what we expect for binary columns. e.g. 0xffffffff is the max 4 bye binary data but this is not a valid utf8 string. @etseidl can elaborate.

@etseidl
Copy link
Contributor

etseidl commented Jun 30, 2022

One requirement for this PR is to ensure that the statistics values are what we expect for binary columns. e.g. 0xffffffff is the max 4 bye binary data but this is not a valid utf8 string. @etseidl can elaborate.

typed_statistics_chunk::minimum_value is initialized to minimum_identity<E>(), which for E==string_view is string_view::max == 0xf7bfbfbf (the max possible UTF-8 encoded value). This works for unicode columns since the highest unicode code point is well below the highest UTF-8 value. For binary data this will get trickier. Consider a data frame with a single row and a single binary column, with the single row consisting of a run of binary beginning with 0xfffffffffefa. Currently, while treating this binary column as a string column, writing this data frame as parquet will result in statistics indicating a minimum value of 0xf7bfbfbf. While still a valid lower bound, it can be confusing since 0xf7bfbfbf is nowhere to be found in the actual data. Using 0xffffffff as the initial min value would be better, but still result in a minimum statistic that is not a valid data value.

col_schema.type = Type::BYTE_ARRAY;
if (col_meta.is_enabled_output_as_binary()) {
col_schema.converted_type = ConvertedType::UNKNOWN;
col_schema.stats_dtype = statistics_dtype::dtype_int8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using dtype_int8 here will result in only 4 bytes being written for the min/max parquet statistics. Would it be possible to add a dtype_binary to statistics_dtype?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're on that path now. I don't believe this is the correct statistics data type for the binary column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will turn into a second PR to correct this. For now, the statistics_type will be string. The next PR is scheduled into the same release.

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpp review. LGTM.

@hyperbolic2346 hyperbolic2346 added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jul 20, 2022
Co-authored-by: MithunR <mythrocks@gmail.com>
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor nitpicks, and requests for clarification.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid. Few minor suggestions, nothing that improves correctness.

cpp/src/io/parquet/parquet_gpu.hpp Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Show resolved Hide resolved
cpp/tests/io/parquet_test.cpp Show resolved Hide resolved
cpp/tests/io/parquet_test.cpp Show resolved Hide resolved
@nvdbaranec
Copy link
Contributor

Note: All Spark plugin tests and integration tests pass with this branch.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, after addressing the comments from the other reviewers.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing all nitpicks!
💯

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python approval. One minor observation that should probably be fixed at some point, but is not a blocker for this PR.

python/cudf/cudf/_lib/cpp/io/types.pxd Show resolved Hide resolved
@vuule
Copy link
Contributor

vuule commented Jul 29, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit d94d761 into rapidsai:branch-22.08 Jul 29, 2022
@hyperbolic2346 hyperbolic2346 deleted the mwilson/parquet_writer_binary branch July 29, 2022 02:31
rapids-bot bot pushed a commit that referenced this pull request Jul 29, 2022
… parquet (#11328)

This is the last major feature in the byte array changes for parquet. This PR enables support for lists of bytes to be written as byte arrays in parquet files. This is a more efficient storage mechanism than what was used before.

Limitations:

- Only top-level lists are currently considered for writing. Some changes are necessary to allow nesting of these including dremel changes which are not here. This isn't a must-have yet, but is desired.
- No dictionary support for lists of bytes. Dictionaries are supported for string columns, so the workaround is currently to change the column type to string before saving and using the option to write as byte arrays. This will require some more work with murmur hash to support.

This is based on top of #11160 and should not merge until it does. Once that merges, the delta here will reduce a good deal.

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #11328
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cuIO cuIO issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Parquet support for reading binary and repeated binary as binary not strings
9 participants