Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nested column selection to parquet reader #8933

Merged
merged 26 commits into from
Aug 19, 2021

Conversation

devavret
Copy link
Contributor

@devavret devavret commented Aug 3, 2021

Closes #8850

Adds ability to select specific children of a nested column. The python API mimics pyarrow and the format is

cudf.read_parquet("test.parquet", columns=["struct1.child1.grandchild2", "struct1.child2"])

The C++ API takes each path as a vector

cudf::io::parquet_reader_options read_args =
  cudf::io::parquet_reader_options::builder(cudf::io::source_info(filepath))
    .columns({{"struct1", "child1", "grandchild2"},
              {"struct1", "child2"}});

@devavret devavret requested review from a team as code owners August 3, 2021 09:31
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Aug 3, 2021
@devavret devavret requested review from nvdbaranec and vuule and removed request for robertmaynard and codereport August 3, 2021 09:31
@devavret
Copy link
Contributor Author

devavret commented Aug 3, 2021

cc @jlowe

@devavret devavret added the 3 - Ready for Review Ready for review by team label Aug 3, 2021
Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use BytesIO instead of writing to disk.

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved
Copy link
Contributor

@cwharris cwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review part 1/2

python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Aug 11, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@29b5f9a). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 8ac1a8c differs from pull request most recent head 1c9ab5b. Consider uploading reports for the commit 1c9ab5b to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #8933   +/-   ##
===============================================
  Coverage                ?   10.71%           
===============================================
  Files                   ?      114           
  Lines                   ?    19103           
  Branches                ?        0           
===============================================
  Hits                    ?     2046           
  Misses                  ?    17057           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 29b5f9a...1c9ab5b. Read the comment docs.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, got some questions/nitpicks.
Not 100% sure I fully understood the logic in the recursive function, will go over it again.

cpp/src/io/parquet/reader_impl.cu Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
cpp/tests/io/parquet_test.cpp Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nitpicks, looks great overall!

cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
@devavret devavret requested a review from vuule August 18, 2021 18:41
@devavret
Copy link
Contributor Author

rerun tests

@devavret
Copy link
Contributor Author

rerun tests

@devavret
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 80bf8d9 into rapidsai:branch-21.10 Aug 19, 2021
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuIO Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond breaking Breaking change cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support selectively reading parts of nested column in Parquet reader
6 participants