Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] read_orc_metadata and read_parquet_metadata in libcudf #11675

Closed
2 tasks done
vuule opened this issue Sep 8, 2022 · 0 comments · Fixed by #13663
Closed
2 tasks done

[FEA] read_orc_metadata and read_parquet_metadata in libcudf #11675

vuule opened this issue Sep 8, 2022 · 0 comments · Fixed by #13663
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@vuule
Copy link
Contributor

vuule commented Sep 8, 2022

libcudf does not have functions to read metadata of a file without reading (a portion) of the data as well.
Exposing an efficient way to get information like column names and the number of rowgroups/stripes would make some uses of read_orc and read_parquet much simpler.

Exact information that these function should return is to be determined.

  • read_orc_metadata
  • read_parquet_metadata
@vuule vuule added feature request New feature or request cuIO cuIO issue labels Sep 8, 2022
@vuule vuule self-assigned this Sep 27, 2022
rapids-bot bot pushed a commit that referenced this issue Nov 1, 2022
Issue #11675

Adds a C++ interface to get information about an ORC file. It is meant to be an efficient way to get information like column names and types, as well as file structure (e.g. number of stripes). The returned structure can be expanded to include more types of metadata, for now it  only returns info that we found relevant internally.

The returned column hierarchy matches the one used in ORC (i.e. root struct column included), not the hierarchy of a cuDF dataframe that the file would be read as (root column children become top level cuDF columns).

This PR also includes improvements to ORC reader benchmarks, enabled by the new metadata API.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - https://github.com/nvdbaranec

URL: #11815
@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023
rapids-bot bot pushed a commit that referenced this issue Jul 27, 2023
Closes #11675
Adds `read_parquet_metadata` to libcudf.
The metadata has following information
- schema - (type, name, children)
- num_rows
- num_rowgroups
- key-value string metadata in file footer

To Reviewers: Request for adding more information in metadata. Refer #11214

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Divye Gala (https://github.com/divyegala)
  - Ray Douglass (https://github.com/raydouglass)

URL: #13663
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants