Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable RLE boolean encoding for v2 Parquet files #13886

Merged
merged 8 commits into from
Aug 17, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Aug 15, 2023

Description

While working on #13707 it was noticed that RLE encoding of booleans had been implemented and then disabled (see this comment for details). This PR re-enables RLE encoding for booleans, but only when V2 headers are being used.

Part of #13501.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner August 15, 2023 22:07
@etseidl etseidl requested review from vyasr and bdice August 15, 2023 22:07
@rapids-bot
Copy link

rapids-bot bot commented Aug 15, 2023

Pull requests from external contributors require approval from a rapidsai organization member with write permissions or greater before CI can begin.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 15, 2023
@GregoryKimball GregoryKimball added the cuIO cuIO issue label Aug 16, 2023
@vuule vuule added breaking Breaking change feature request New feature or request labels Aug 16, 2023
@vuule
Copy link
Contributor

vuule commented Aug 16, 2023

/ok to test

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Show resolved Hide resolved
@bdice
Copy link
Contributor

bdice commented Aug 16, 2023

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few comments.

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Show resolved Hide resolved
if (dict_bits >= 0 && physical_type != BOOLEAN) {
dst[0] = dict_bits;
s->rle_out = dst + 1;
} else if (is_v2 && physical_type == BOOLEAN) {
// save space for RLE length. we don't know the total length yet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is such adjustment only needed for bool RLE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So RLE is only used for 1) rep/def levels, 2) dictionary indexes, and 3) booleans. So booleans are the only data type that will be RLE encoded...everything else will be plain or dictionary (or eventually delta).

There's actually a table in the spec that lists when a length is prepended and when it is not.

+--------------+------------------------+-----------------+
| Page kind    | RLE-encoded data kind  | Prepend length? |
+--------------+------------------------+-----------------+
| Data page v1 | Definition levels      | Y               |
|              | Repetition levels      | Y               |
|              | Dictionary indices     | N               |
|              | Boolean values         | Y               |
+--------------+------------------------+-----------------+
| Data page v2 | Definition levels      | N               |
|              | Repetition levels      | N               |
|              | Dictionary indices     | N               |
|              | Boolean values         | Y               |
+--------------+------------------------+-----------------+

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I get it, thanks.
And this is something that does not make sense to include in fragment size?
I'm asking because this "max RLE page size" is implemented as a util function in the ORC writer, so this manual padding looked odd in comparison.

BTW, must be nice not having RLE in almost every data type (cries in ORC).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is something that does not make sense to include in fragment size?

yeah, because it's just once per page, not every fragment. although, yeah, maybe it should be included in the calculation at line 451. I'll ponder that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think that fixed it. nice catch! That would have been a lot of fun tracking down later :P

@vuule
Copy link
Contributor

vuule commented Aug 17, 2023

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@vuule
Copy link
Contributor

vuule commented Aug 17, 2023

/merge

@rapids-bot rapids-bot bot merged commit 41f0caf into rapidsai:branch-23.10 Aug 17, 2023
54 checks passed
@etseidl etseidl deleted the feature/enable_rle_bool branch August 17, 2023 04:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants