Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing crash when writing binary nested data in parquet #11526

Conversation

hyperbolic2346
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 commented Aug 13, 2022

Description

This fixes the crash described in the bug related to writing nested data in parquet with the binary flag set to write binary data as byte_arrays. We were incorrectly selecting the top-most node instead of the list, which resulted in a crash down in the kernels when the data pointer was null for those upper list columns.

closes #11506

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@hyperbolic2346 hyperbolic2346 added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue non-breaking Non-breaking change labels Aug 13, 2022
@hyperbolic2346 hyperbolic2346 self-assigned this Aug 13, 2022
@hyperbolic2346 hyperbolic2346 requested a review from a team as a code owner August 13, 2022 01:19
@codecov
Copy link

codecov bot commented Aug 13, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@9c22da5). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.10   #11526   +/-   ##
===============================================
  Coverage                ?   86.36%           
===============================================
  Files                   ?      145           
  Lines                   ?    22949           
  Branches                ?        0           
===============================================
  Hits                    ?    19820           
  Misses                  ?     3129           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

cpp/src/io/utilities/column_utils.cuh Outdated Show resolved Hide resolved
: col.child(0);
// stop early if writing a byte array, it needs to be a list<int8> not the int8 column
if (col_desc[index].stats_dtype == dtype_byte_array &&
(child.type().id() == type_id::INT8 || child.type().id() == type_id::UINT8)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Spark differentiate binary types stored as int8 and uint8? Are both of these equivalent from Spark’s perspective? Can Spark normalize inputs to use only one or the other? It feels very suspicious to me that we’ve decided to accept either one across the binary code paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We represent these as list<int8> due to the way string columns are created, but use std::byte in byte_array_view, which is unsigned. I prefer to represent byte-like things as unsigned. I didn't know exactly where we would land and didn't want to have trouble and so I decided to support both signed and unsigned. I'm happy to discuss this further and come to an agreement, but I would suggest a new issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::byte is neither signed nor unsigned. https://godbolt.org/z/bo1rorxGM

Interpreting a std::byte as a signed or unsigned requires an explicit cast. We can probably pick one and stick with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cpp reference lists it as: enum class byte : unsigned char {} ;, which is why I said that. I see your point though and it should be agnostic as it is simply a byte. I would personally pick uint8 for the column type, but it seems that int8 has momentum due to its use in string columns. Happy to go either way. Should I make a new issue to address this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a new issue sounds good. I'd suggest using int8 to align with strings columns. Moreover, tracking the signed-ness is not relevant to us, since there's no arithmetic operations needed on binary data and certain optimizations may be easier for signed types.

cpp/src/io/utilities/column_utils.cuh Outdated Show resolved Hide resolved
@hyperbolic2346
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit dd0ff30 into rapidsai:branch-22.10 Aug 15, 2022
@hyperbolic2346 hyperbolic2346 deleted the mwilson/parquet_nested_binary_write branch August 15, 2022 23:10
rapids-bot bot pushed a commit that referenced this pull request Oct 18, 2022
As suggested in #11526 and captured in issue #11536 the usage of both INT8 and UINT8 as supported types for byte_arrays is unnecessary and adds complexity to the code. This change removes INT8 as an option and only allows UINT8 columns to be written out as byte_arrays. ~~This matches with cudf string columns which contain an INT8 column for data.~~

closes #11536

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Tobias Ribizel (https://github.com/upsj)
  - Nghia Truong (https://github.com/ttnghia)
  - David Wendt (https://github.com/davidwendt)
  - MithunR (https://github.com/mythrocks)
  - Bradley Dice (https://github.com/bdice)

URL: #11539
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] writing binary data in a struct causes out of index reads
3 participants