Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to request Parquet encodings on a per-column basis #15081

Merged
merged 30 commits into from
Mar 6, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Feb 16, 2024

Description

Allows users to request specific page encodings to use on a column-by-column basis. This is accomplished by adding an encoding property to the column_input_metadata struct. This is a necessary change before adding DELTA_BYTE_ARRAY encoding.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner February 16, 2024 23:38
Copy link

copy-pr-bot bot commented Feb 16, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 16, 2024
/**
* @brief Valid parquet encodings for use with `column_in_metadata::set_encoding()`
*/
struct parquet_encoding {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note on my thinking here (or lack thereof)...I was thinking using strings to specify the desired encoding would be better than an enum since the column_input_metadata is shared between multiple encoders, and it would be more natural to use strings with a CLI or through the python interface. And if ORC has different encoding names, then we could add another set of constants for that.

But a user interface could translate string values to an enum, and the enum could just add as many fields as necessary, some not relevant to one implementation or the other, so maybe this is silly. This acts like a scoped enum already, so I'm not opposed to switching.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem like this could be an enum :) I don't think a lot of code would change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I rushed a bit with this comment. Would this become just encoding and contain a superset of all encoding types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably, or page_encoding maybe. But it would have to apply to all encoders that use the metadata (which I assume is just parquet and orc for now).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

page_encoding isn't a great name for ORC...how about column_encoding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't believe I forgot my boy ORC.
sounds good!

@harrism harrism removed their request for review February 21, 2024 04:09
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_common.hpp Outdated Show resolved Hide resolved
@vuule vuule added feature request New feature or request non-breaking Non-breaking change labels Feb 27, 2024
@vuule vuule self-requested a review February 27, 2024 20:14
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested expanded logging to ensure we never silently ignore a requested encoding.

cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
@etseidl
Copy link
Contributor Author

etseidl commented Mar 5, 2024

Test checked in...but last CI run failed style check because of some python?

@vuule
Copy link
Contributor

vuule commented Mar 5, 2024

Test checked in...but last CI run failed style check because of some python?

Yeah, that's a new one. I think there was a two-hour window when CI passed 💀

@vuule
Copy link
Contributor

vuule commented Mar 5, 2024

/ok to test

@@ -585,6 +605,7 @@ class column_in_metadata {
std::optional<uint8_t> _decimal_precision;
std::optional<int32_t> _parquet_field_id;
std::vector<column_in_metadata> children;
column_encoding _encoding = column_encoding::NOT_SET;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
column_encoding _encoding = column_encoding::NOT_SET;
column_encoding _encoding{column_encoding::NOT_SET};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing else in the struct (or AFAICT in this file) uses an initializer list...maybe clean this up separately?

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Feel free to merge when having the second approval.

@PointKernel
Copy link
Member

/ok to test

@vuule vuule requested a review from ttnghia March 5, 2024 22:08
Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits but otherwise LGTM.

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Mar 5, 2024

Perhaps not in this PR, but we will ultimately want to expose this in cuDF-python. In pandas, to_parquet with the pyarrow engine supports the same column_encoding parameter.

@etseidl
Copy link
Contributor Author

etseidl commented Mar 5, 2024

Perhaps not in this PR, but we will ultimately want to expose this in cuDF-python. pyarrow supports the same column_encoding parameter.

That was my original plan, but that's a heavier lift and I just wanted the bare minimum to at least be able to test new encoders. The pain point with how pyarrow does it is knowing in advance what the column names will be, esp for nested (see #14539 for instance).

etseidl and others added 2 commits March 5, 2024 14:23
Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>
@vuule
Copy link
Contributor

vuule commented Mar 5, 2024

/ok to test

@PointKernel
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit dbf7236 into rapidsai:branch-24.04 Mar 6, 2024
74 checks passed
@etseidl etseidl deleted the select_encodings branch March 6, 2024 15:57
rapids-bot bot pushed a commit that referenced this pull request Mar 8, 2024
Re-submission of #14938. Final (delta) piece of #13501.

Adds the ability to encode Parquet pages as DELTA_BYTE_ARRAY. Python testing wlll be added as a follow-on when per-column encoding selection is added to the python API (ref this [comment](#15081 (comment))).

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #15239
rapids-bot bot pushed a commit that referenced this pull request Apr 18, 2024
…15411)

#15081 added the ability to select per-column encodings in the Parquet writer. Some Parquet encodings (e.g `DELTA_BINARY_PACKED`) do not mix well with compression (see [PARQUET-2414](https://issues.apache.org/jira/browse/PARQUET-2414) for example). This PR adds the ability to turn off compression for select columns. This uses the same mechanism as encoding selection, so an example use would be:
```c++
  cudf::io::table_input_metadata table_metadata(table);
  table_metadata.column_metadata[0]
    .set_name("int_delta_binary")
    .set_encoding(cudf::io::column_encoding::DELTA_BINARY_PACKED)
    .set_skip_compression(true);
```

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Bradley Dice (https://github.com/bdice)

URL: #15411
rapids-bot bot pushed a commit that referenced this pull request May 22, 2024
…PI (#15613)

Several recent PRs (#15081, #15411, #15600) added the ability to control some aspects of Parquet file writing on a per-column basis. During discussion of #15081 it was [suggested](#15081 (comment)) that these options be exposed by cuDF-python in a manner similar to pyarrow. This PR adds the ability to control per-column encoding, compression, binary output, and fixed-length data width, using fully qualified Parquet column names. For example, given a cuDF table with an integer column 'a', and a `list<int32>` column 'b', the fully qualified column names would be 'a' and 'b.list.element'.

Addresses "Add cuDF-python API support for specifying encodings" task in #13501.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #15613
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants