Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add framework for batch compression of shuffle partitions #487

Merged
merged 13 commits into from
Aug 11, 2020

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Jul 31, 2020

This adds TableCompressionCodec, an interface for compression codecs that can compress contiguous table buffers. A single trivial copy codec is provided just for testing. Other codecs can be added in the future.

Batch compression is supported and utilized in the partitioning code when using the RAPIDS UCX shuffle plugin which produces contiguous tables. Batch decompression is supported and utilized in the batch coalescing that is inserted into the query after each shuffle. The coalesce code automatically detects compressed shuffle buffers and handles a mix of uncompressed and compressed batches arriving to be coalesced.

If desired I can split this PR into smaller pieces (e.g.: just adding TableCompressor and copy codec, then separate PR to plug it into partitioning and coalescing). I opted to do a single PR initially so it was clear how the table compression APIs were being used in practice.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@jlowe jlowe added SQL part of the SQL/Dataframe plugin shuffle things that impact the shuffle plugin labels Jul 31, 2020
@jlowe jlowe self-assigned this Jul 31, 2020
@jlowe
Copy link
Member Author

jlowe commented Jul 31, 2020

build

@jlowe
Copy link
Member Author

jlowe commented Aug 3, 2020

build

Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a first pass @jlowe I have more than 1/2 to review.

@abellina
Copy link
Collaborator

abellina commented Aug 4, 2020

@jlowe I got through the rest of it. Let me know when to re-check.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@jlowe
Copy link
Member Author

jlowe commented Aug 4, 2020

@revans2 @abellina I believe I've addressed all the review comments, so this is ready for another look.

@jlowe
Copy link
Member Author

jlowe commented Aug 4, 2020

build

abellina
abellina previously approved these changes Aug 5, 2020
@jlowe jlowe marked this pull request as draft August 6, 2020 22:29
@jlowe
Copy link
Member Author

jlowe commented Aug 6, 2020

This needs to be upmerged to the latest on branch-0.2.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@jlowe
Copy link
Member Author

jlowe commented Aug 7, 2020

build

@jlowe
Copy link
Member Author

jlowe commented Aug 7, 2020

build

abellina
abellina previously approved these changes Aug 8, 2020
Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@jlowe
Copy link
Member Author

jlowe commented Aug 10, 2020

build

@jlowe
Copy link
Member Author

jlowe commented Aug 10, 2020

@revans2 can you take another look? I needed to upmerge to the latest on branch-0.2.

revans2
revans2 previously approved these changes Aug 10, 2020
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I am fine with it as is, just was curious if it would be cleaner?

@@ -590,8 +590,27 @@ object RapidsConf {
.bytesConf(ByteUnit.BYTE)
.createWithDefault(50 * 1024)

val SHUFFLE_COMPRESSION_ENABLED = conf("spark.rapids.shuffle.compression.enabled")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with two configs for this, but why not just set the codec to NONE, or something like that to disable it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was emulating Spark's shuffle config where the compression codec and compression enable are separate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to use a single config for the shuffle codec, with none indicating no shuffle compression codec should be used.

@jlowe
Copy link
Member Author

jlowe commented Aug 11, 2020

Holding off on the CI build for now. Waiting on the HashJoin fix for 3.1 and updated cudf with the fix for rapidsai/cudf#5915 which was found during the added unit tests in this PR.

@jlowe
Copy link
Member Author

jlowe commented Aug 11, 2020

build

@jlowe jlowe merged commit 3985f67 into NVIDIA:branch-0.2 Aug 11, 2020
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Add framework for batch compression of shuffle partitions

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* Extract common row accessor methods to GpuColumnVectorBase

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* Address review comments

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* Fix handling of degenerate batches

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* Fix buffer leak

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* Add comment about degenerate batches potentially appearing compressed

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* Use a shuffle compression codec of "none" to indicate no compression

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Add framework for batch compression of shuffle partitions

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* Extract common row accessor methods to GpuColumnVectorBase

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* Address review comments

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* Fix handling of degenerate batches

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* Fix buffer leak

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* Add comment about degenerate batches potentially appearing compressed

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* Use a shuffle compression codec of "none" to indicate no compression

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@jlowe jlowe deleted the table-compressor branch September 10, 2021 15:31
pxLi pushed a commit to pxLi/spark-rapids that referenced this pull request May 12, 2022
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
shuffle things that impact the shuffle plugin SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants