Adding binary read/write as options for parquet #11160

hyperbolic2346 · 2022-06-28T00:55:17Z

There are a couple of issues(#11044 and #10778) revolving around adding support for binary writes and reads to parquet. The desire is to be able to write strings and lists of int8 values as binary. This PR adds support for strings to be written as binary and for binary data to be read as binary or strings. I have left the default for binary data to read as a string to prevent any surprises upon upgrade.

Single-depth list columns of int8 and uint8 values are not written as binary with this change. That will be another PR after discussions about the possible impact of the change.

Closes #11044
Issue #10778

codecov · 2022-06-28T02:49:42Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@0f860ea). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #11160   +/-   ##
===============================================
  Coverage                ?   86.43%           
===============================================
  Files                   ?      144           
  Lines                   ?    22808           
  Branches                ?        0           
===============================================
  Hits                    ?    19714           
  Misses                  ?     3094           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0f860ea...c45b454. Read the comment docs.

devavret · 2022-06-30T15:26:12Z

One requirement for this PR is to ensure that the statistics values are what we expect for binary columns. e.g. 0xffffffff is the max 4 bye binary data but this is not a valid utf8 string. @etseidl can elaborate.

etseidl · 2022-06-30T18:05:32Z

One requirement for this PR is to ensure that the statistics values are what we expect for binary columns. e.g. 0xffffffff is the max 4 bye binary data but this is not a valid utf8 string. @etseidl can elaborate.

typed_statistics_chunk::minimum_value is initialized to minimum_identity<E>(), which for E==string_view is string_view::max == 0xf7bfbfbf (the max possible UTF-8 encoded value). This works for unicode columns since the highest unicode code point is well below the highest UTF-8 value. For binary data this will get trickier. Consider a data frame with a single row and a single binary column, with the single row consisting of a run of binary beginning with 0xfffffffffefa. Currently, while treating this binary column as a string column, writing this data frame as parquet will result in statistics indicating a minimum value of 0xf7bfbfbf. While still a valid lower bound, it can be confusing since 0xf7bfbfbf is nowhere to be found in the actual data. Using 0xffffffff as the initial min value would be better, but still result in a minimum statistic that is not a valid data value.

etseidl · 2022-06-30T18:36:09Z

cpp/src/io/parquet/writer_impl.cu

+    col_schema.type = Type::BYTE_ARRAY;
+    if (col_meta.is_enabled_output_as_binary()) {
+      col_schema.converted_type = ConvertedType::UNKNOWN;
+      col_schema.stats_dtype    = statistics_dtype::dtype_int8;


I think using dtype_int8 here will result in only 4 bytes being written for the min/max parquet statistics. Would it be possible to add a dtype_binary to statistics_dtype?

I think we're on that path now. I don't believe this is the correct statistics data type for the binary column.

This will turn into a second PR to correct this. For now, the statistics_type will be string. The next PR is scheduled into the same release.

karthikeyann

cpp review. LGTM.

cpp/tests/io/parquet_test.cpp

Co-authored-by: MithunR <mythrocks@gmail.com>

cpp/include/cudf/io/types.hpp

cpp/src/io/parquet/reader_impl.cu

cpp/tests/io/parquet_test.cpp

cpp/src/io/parquet/parquet_gpu.hpp

mythrocks

A couple of minor nitpicks, and requests for clarification.

vuule

Looks solid. Few minor suggestions, nothing that improves correctness.

cpp/src/io/parquet/parquet_gpu.hpp

cpp/src/io/parquet/reader_impl.cu

cpp/include/cudf/io/parquet.hpp

cpp/tests/io/parquet_test.cpp

nvdbaranec · 2022-07-28T19:55:43Z

Note: All Spark plugin tests and integration tests pass with this branch.

mythrocks

👍, after addressing the comments from the other reviewers.

vuule

Thank you for addressing all nitpicks!
💯

…2346/cudf into mwilson/parquet_writer_binary

vyasr

Python approval. One minor observation that should probably be fixed at some point, but is not a blocker for this PR.

python/cudf/cudf/_lib/cpp/io/types.pxd

vuule · 2022-07-29T00:05:43Z

@gpucibot merge

… parquet (#11328) This is the last major feature in the byte array changes for parquet. This PR enables support for lists of bytes to be written as byte arrays in parquet files. This is a more efficient storage mechanism than what was used before. Limitations: - Only top-level lists are currently considered for writing. Some changes are necessary to allow nesting of these including dremel changes which are not here. This isn't a must-have yet, but is desired. - No dictionary support for lists of bytes. Dictionaries are supported for string columns, so the workaround is currently to change the column type to string before saving and using the option to write as byte arrays. This will require some more work with murmur hash to support. This is based on top of #11160 and should not merge until it does. Once that merges, the delta here will reduce a good deal. Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - MithunR (https://github.com/mythrocks) - Vukasin Milovanovic (https://github.com/vuule) URL: #11328

hyperbolic2346 requested a review from a team as a code owner June 28, 2022 00:55

hyperbolic2346 self-assigned this Jun 28, 2022

hyperbolic2346 requested review from a team as code owners June 28, 2022 00:55

hyperbolic2346 requested review from trxcllnt and vyasr June 28, 2022 00:55

hyperbolic2346 removed the improvement Improvement / enhancement to an existing function label Jun 28, 2022

hyperbolic2346 requested a review from devavret June 28, 2022 18:51

etseidl mentioned this pull request Jun 29, 2022

Modify Parquet writer to produce column indexes #11171

Closed

etseidl reviewed Jun 30, 2022

View reviewed changes

karthikeyann approved these changes Jul 11, 2022

View reviewed changes

devavret reviewed Jul 11, 2022

View reviewed changes

cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved

hyperbolic2346 requested a review from devavret July 14, 2022 00:35

hyperbolic2346 mentioned this pull request Jul 19, 2022

Adding byte_array statistics #11303

Merged

3 tasks

hyperbolic2346 added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jul 20, 2022

hyperbolic2346 added 3 commits July 21, 2022 17:47

adding byte array view structure

fcc3939

updating from review comments on another PR

72f50c5

adding some nodiscards

98bc6cf

Update cpp/include/cudf/io/parquet.hpp

776342e

Co-authored-by: MithunR <mythrocks@gmail.com>