Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for struct type in ORC writer #9025

Merged
merged 82 commits into from
Sep 22, 2021

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Aug 12, 2021

Fixes #7830, #8443

Features:

  • Use the new table metadata type that matches the table hierarchy, table_input_metadata.
  • Support struct columns in the writer.

Changes:

  • Null masks are encoded as aligned rowgroups to avoid invalid bits when the number of encoded rows is not divisible by 8 (except for the last rowgroup in each stripe). This also affects list columns. The issue is equivalent to [BUG] Bools written by cuIO ORC writer don't match when read by pyarrow/pyorc #6763 (boolean columns only).
  • Added pushdown masks that are used to determine which child elements should not be encoded, including null mask bits.
  • Use pushdown masks for rowgroup alignment, null mask encoding and value encoding.
  • Separated the null mask encoding from value encoding - can be further moved to a separate kernel call.

Breaking because the table metadata type has changed.

@vuule vuule self-assigned this Aug 12, 2021
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Aug 12, 2021
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python/Cython approval, with a couple of small questions.

C++/CUDA looked OK, but probably best to get approval from someone with greater familiarity. I didn't review those files in depth.

Is there a reason that submodule updates are present in this PR? The current build failure may be related since it looked like it was failing to pull in jitify.

table_view_from_table(table, ignore_index=True)
)
if self.index is not False:
if isinstance(table._index, cudf.core.multiindex.MultiIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In write_orc the corresponding block checks for any non-RangeIndex type. Is that discrepancy intentional or meaningful? Or do you always get either a MultiIndex or a RangeIndex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to defer to @devavret or @galipremsagar on this one. The Python changes in this PR are based on the equivalent Parquet PR #7461.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block is a replacement of the logic of utils.get_column_names which assigns the names of the columns to metadata. We don't want to update the metadata for RangeIndex because metadata contains only information on libcudf column_views. We could get a single series materialized index too, which is captured by the second part of the branch.

Regarding writing RangeIndex, this is currently broken #7011 (and re-reported in #9216) because it might take some smarts to merge range indices across different calls to write_table. The RangeIndex only gets written in the pandas specific metadata and passes straight through libcudf untouched. If you have some time, I'd invite you to take on #7011

python/cudf/cudf/tests/test_orc.py Show resolved Hide resolved
@vuule
Copy link
Contributor Author

vuule commented Sep 17, 2021

Is there a reason that submodule updates are present in this PR? The current build failure may be related since it looked like it was failing to pull in jitify.

They got pulled in when I merged 21.10. Not sure what I did wrong and how to revert it. I have these two files marked as modified in git. When I try to update submodules, it fails with No url found for submodule path 'thirdparty/jitify' in .gitmodules.

@vuule
Copy link
Contributor Author

vuule commented Sep 17, 2021

rerun tests

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review but GitHub won't let me reply to @Vyas' comment without publishing my review. 🤷‍♂️

cpp/include/cudf/io/types.hpp Outdated Show resolved Hide resolved
cpp/src/io/orc/stripe_init.cu Show resolved Hide resolved
table_view_from_table(table, ignore_index=True)
)
if self.index is not False:
if isinstance(table._index, cudf.core.multiindex.MultiIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block is a replacement of the logic of utils.get_column_names which assigns the names of the columns to metadata. We don't want to update the metadata for RangeIndex because metadata contains only information on libcudf column_views. We could get a single series materialized index too, which is captured by the second part of the branch.

Regarding writing RangeIndex, this is currently broken #7011 (and re-reported in #9216) because it might take some smarts to merge range indices across different calls to write_table. The RangeIndex only gets written in the pandas specific metadata and passes straight through libcudf untouched. If you have some time, I'd invite you to take on #7011

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@rgsl888prabhu rgsl888prabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial pass, still going through PR.

cpp/src/io/orc/stripe_enc.cu Show resolved Hide resolved
cpp/src/io/orc/stripe_enc.cu Show resolved Hide resolved
@vuule vuule added 4 - Needs cuIO Reviewer and removed 2 - In Progress Currently a work in progress labels Sep 21, 2021
@vuule
Copy link
Contributor Author

vuule commented Sep 22, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 2c6b39b into rapidsai:branch-21.10 Sep 22, 2021
@vuule vuule deleted the fea-orc-write-struct branch September 22, 2021 22:31
devavret added a commit to devavret/cudf that referenced this pull request Sep 23, 2021
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuIO Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond breaking Breaking change CMake CMake build issue cuIO cuIO issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support struct types in ORC writer
8 participants