Fixing crash when writing binary nested data in parquet #11526

hyperbolic2346 · 2022-08-13T01:19:41Z

Description

This fixes the crash described in the bug related to writing nested data in parquet with the binary flag set to write binary data as byte_arrays. We were incorrectly selecting the top-most node instead of the list, which resulted in a crash down in the kernels when the data pointer was null for those upper list columns.

closes #11506

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

cpp/src/io/utilities/column_utils.cuh

codecov · 2022-08-13T03:12:04Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@9c22da5). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.10   #11526   +/-   ##
===============================================
  Coverage                ?   86.36%           
===============================================
  Files                   ?      145           
  Lines                   ?    22949           
  Branches                ?        0           
===============================================
  Hits                    ?    19820           
  Misses                  ?     3129           
  Partials                ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

cpp/src/io/utilities/column_utils.cuh

bdice · 2022-08-15T15:46:58Z

cpp/src/io/utilities/column_utils.cuh

+                       : col.child(0);
+        // stop early if writing a byte array, it needs to be a list<int8> not the int8 column
+        if (col_desc[index].stats_dtype == dtype_byte_array &&
+            (child.type().id() == type_id::INT8 || child.type().id() == type_id::UINT8)) {


Does Spark differentiate binary types stored as int8 and uint8? Are both of these equivalent from Spark’s perspective? Can Spark normalize inputs to use only one or the other? It feels very suspicious to me that we’ve decided to accept either one across the binary code paths.

We represent these as list<int8> due to the way string columns are created, but use std::byte in byte_array_view, which is unsigned. I prefer to represent byte-like things as unsigned. I didn't know exactly where we would land and didn't want to have trouble and so I decided to support both signed and unsigned. I'm happy to discuss this further and come to an agreement, but I would suggest a new issue.

std::byte is neither signed nor unsigned. https://godbolt.org/z/bo1rorxGM

Interpreting a std::byte as a signed or unsigned requires an explicit cast. We can probably pick one and stick with it.

Cpp reference lists it as: enum class byte : unsigned char {} ;, which is why I said that. I see your point though and it should be agnostic as it is simply a byte. I would personally pick uint8 for the column type, but it seems that int8 has momentum due to its use in string columns. Happy to go either way. Should I make a new issue to address this?

Yes, a new issue sounds good. I'd suggest using int8 to align with strings columns. Moreover, tracking the signed-ness is not relevant to us, since there's no arithmetic operations needed on binary data and certain optimizations may be easier for signed types.

cpp/src/io/utilities/column_utils.cuh

hyperbolic2346 · 2022-08-15T18:42:23Z

@gpucibot merge

As suggested in #11526 and captured in issue #11536 the usage of both INT8 and UINT8 as supported types for byte_arrays is unnecessary and adds complexity to the code. This change removes INT8 as an option and only allows UINT8 columns to be written out as byte_arrays. ~~This matches with cudf string columns which contain an INT8 column for data.~~ closes #11536 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Tobias Ribizel (https://github.com/upsj) - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) - MithunR (https://github.com/mythrocks) - Bradley Dice (https://github.com/bdice) URL: #11539

fixing nested write issue

9cb33c2

hyperbolic2346 added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue non-breaking Non-breaking change labels Aug 13, 2022

hyperbolic2346 self-assigned this Aug 13, 2022

hyperbolic2346 requested a review from a team as a code owner August 13, 2022 01:19

hyperbolic2346 requested review from robertmaynard and ttnghia August 13, 2022 01:19

hyperbolic2346 mentioned this pull request Aug 13, 2022

Adding optional parquet reader schema #11524

Merged

3 tasks

ttnghia reviewed Aug 13, 2022

View reviewed changes

cpp/src/io/utilities/column_utils.cuh Show resolved Hide resolved

bdice reviewed Aug 15, 2022

View reviewed changes

review comments update

5bcfc59

hyperbolic2346 requested review from bdice and ttnghia August 15, 2022 16:30

hyperbolic2346 mentioned this pull request Aug 15, 2022

Need to remove int8 column support in parquet binary code path. #11536

Closed

bdice approved these changes Aug 15, 2022

View reviewed changes

rapids-bot bot merged commit dd0ff30 into rapidsai:branch-22.10 Aug 15, 2022

hyperbolic2346 deleted the mwilson/parquet_nested_binary_write branch August 15, 2022 23:10

hyperbolic2346 mentioned this pull request Aug 16, 2022

Removing int8 column option from parquet byte_array writing #11539

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing crash when writing binary nested data in parquet #11526

Fixing crash when writing binary nested data in parquet #11526

hyperbolic2346 commented Aug 13, 2022 •

edited

Loading

codecov bot commented Aug 13, 2022 •

edited

Loading

bdice Aug 15, 2022

hyperbolic2346 Aug 15, 2022

bdice Aug 15, 2022

hyperbolic2346 Aug 15, 2022

bdice Aug 15, 2022

hyperbolic2346 commented Aug 15, 2022

Fixing crash when writing binary nested data in parquet #11526

Fixing crash when writing binary nested data in parquet #11526

Conversation

hyperbolic2346 commented Aug 13, 2022 • edited Loading

Description

Checklist

codecov bot commented Aug 13, 2022 • edited Loading

Codecov Report

bdice Aug 15, 2022

Choose a reason for hiding this comment

hyperbolic2346 Aug 15, 2022

Choose a reason for hiding this comment

bdice Aug 15, 2022

Choose a reason for hiding this comment

hyperbolic2346 Aug 15, 2022

Choose a reason for hiding this comment

bdice Aug 15, 2022

Choose a reason for hiding this comment

hyperbolic2346 commented Aug 15, 2022

hyperbolic2346 commented Aug 13, 2022 •

edited

Loading

codecov bot commented Aug 13, 2022 •

edited

Loading