Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More granular column selection in ORC reader #9496

Merged
merged 31 commits into from
Oct 28, 2021

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Oct 22, 2021

Closes #8848

  • Allows caller to specify nested column paths, so that the fields not listed in the columns parameter are excluded.
  • The order of fields/columns in the output table is consistent with the order of paths/names in the columns parameter.
  • Moved aggregate_orc_metadata implementation to a separate file (can be .cpp!)
  • Add tests to cover different cases with a mix of nested and parent columns selection.
  • changed a few fields from uint32_t to int32_t to avoid unsigned arithmetic.

@vuule vuule added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change labels Oct 22, 2021
@vuule vuule self-assigned this Oct 22, 2021
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 22, 2021
@codecov
Copy link

codecov bot commented Oct 22, 2021

Codecov Report

Merging #9496 (3c0d862) into branch-21.12 (ab4bfaa) will decrease coverage by 0.12%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9496      +/-   ##
================================================
- Coverage         10.79%   10.66%   -0.13%     
================================================
  Files               116      117       +1     
  Lines             18869    19725     +856     
================================================
+ Hits               2036     2104      +68     
- Misses            16833    17621     +788     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/sorting.py 92.90% <0.00%> (-1.21%) ⬇️
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
... and 66 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0951ba...3c0d862. Read the comment docs.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Oct 22, 2021
@vuule vuule requested review from a team as code owners October 25, 2021 23:46
@vuule vuule added the 3 - Ready for Review Ready for review by team label Oct 26, 2021
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes 👍

Copy link
Contributor

@rgsl888prabhu rgsl888prabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, looks pretty good, just had few questions.

cpp/src/io/orc/aggregate_orc_metadata.hpp Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_orc.py Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to see more code in cpp files, and I like the new feature (although I had a question about nesting depth that I've left inline). One overarching question: there's a lot of int32_t usage in this PR, is that because of the ORC specification or should the code be making use of cudf::size_type? Aside from that, I don't know this code too well so I left some questions in places that I didn't understand and comments on things that weren't really changes in this PR. Hopefully they're not too troublesome, feel free to ignore things that you need to.

cpp/src/io/orc/orc.cpp Show resolved Hide resolved
cpp/src/io/orc/orc.h Outdated Show resolved Hide resolved
cpp/src/io/orc/orc.h Show resolved Hide resolved
cpp/src/io/orc/orc.h Show resolved Hide resolved
cpp/src/io/orc/aggregate_orc_metadata.hpp Outdated Show resolved Hide resolved
cpp/src/io/orc/aggregate_orc_metadata.cpp Outdated Show resolved Hide resolved
cpp/src/io/orc/aggregate_orc_metadata.cpp Outdated Show resolved Hide resolved
cpp/src/io/orc/aggregate_orc_metadata.cpp Outdated Show resolved Hide resolved
cpp/src/io/orc/aggregate_orc_metadata.cpp Outdated Show resolved Hide resolved
cpp/src/io/orc/orc.cpp Show resolved Hide resolved
@vuule
Copy link
Contributor Author

vuule commented Oct 27, 2021

@vyasr Thank you for the detailed review :)
Regarding

there's a lot of int32_t usage in this PR, is that because of the ORC specification or should the code be making use of cudf::size_type?

I did considered this option. My understanding is that size_type is used to represent the number of rows/row index. The uses here are all column indices. I wasn't sure is size_type applied there too, so I left as int32_t. ORC specs suggest uint32_t, which is more error-prone etc. Open for suggestions, maybe there should even be a new alias in use here.
Edit: switched to size_type :)

@vuule vuule requested a review from vyasr October 27, 2021 20:14
@vyasr
Copy link
Contributor

vyasr commented Oct 28, 2021

@vyasr Thank you for the detailed review :) Regarding

there's a lot of int32_t usage in this PR, is that because of the ORC specification or should the code be making use of cudf::size_type?

I did considered this option. My understanding is that size_type is used to represent the number of rows/row index. The uses here are all column indices. I wasn't sure is size_type applied there too, so I left as int32_t. ORC specs suggest uint32_t, which is more error-prone etc. Open for suggestions, maybe there should even be a new alias in use here. Edit: switched to size_type :)

Just following up even though you did already make the change: I think size_type is also appropriate for column indices. My reasoning is that when we index into columns the expected input is a size_type (for example, via table_view::column), and size_type defines the bounds for the number of columns. I've also seen enough code switching from raw for loops to algorithms where the original for loops used size_type to make me think that was the expected use case for the type.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for answering my questions! This PR LGTM, but would probably benefit from an additional CPP review from a cuIO expert.

@vuule
Copy link
Contributor Author

vuule commented Oct 28, 2021

Thanks for answering my questions! This PR LGTM, but would probably benefit from an additional CPP review from a cuIO expert.

@rgsl888prabhu is the expert on this code, so this should be covered :)

Copy link
Contributor

@rgsl888prabhu rgsl888prabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, only have couple of queries on testing.

python/cudf/cudf/tests/test_orc.py Show resolved Hide resolved
@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Oct 28, 2021
@vuule
Copy link
Contributor Author

vuule commented Oct 28, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 8e0e70d into rapidsai:branch-21.12 Oct 28, 2021
@vuule vuule deleted the fea-select-nested-cols-orc branch October 28, 2021 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support nested column pruning in ORC reader when reading a struct column.
4 participants