Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunking input before writing a ParquetCachedBatch #1265

Merged
merged 13 commits into from
Dec 7, 2020

Conversation

razajafri
Copy link
Collaborator

This PR deals with breaking up the incoming batches before writing them to a ParquetCachedBatch to make sure the CachedBuffer never exceeds the Int max-value.

I am still working on writing the tests to test the CPU writing. This PR so far only has tests for the compressColumnarBatchWithParquet

@@ -253,6 +253,16 @@ private case class CloseableColumnBatchIterator(iter: Iterator[ColumnarBatch]) e
* This class assumes, the data is Columnar and the plugin is on
*/
class ParquetCachedBatchSerializer extends CachedBatchSerializer with Arm {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was made to gain access to the Inner class, open to suggestions to make this better

@razajafri
Copy link
Collaborator Author

This is pretty much done. It's still a WIP because the thing remaining is that the tests won't compile with Spark 3.0.0. I have used reflection for some tests that were simple but for others, especially the ones on the CPU, the tests looked really hard to read after I tried reflection.

@tgravescs following up on our conversation, I think we should think about alternatives like you said, maybe these tests can live with the shim310?

@razajafri
Copy link
Collaborator Author

As decided in the sprint-planning I am removing the WIP tag and marking this for review. We will include tests for the CPU part when we have a framework to compile tests selectively

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
This commit should be reverted once we have a framework for selectively
building tests based on Spark version
@razajafri razajafri changed the title [WIP] Chunking input before writing a ParquetCachedBatch Chunking input before writing a ParquetCachedBatch Dec 4, 2020
…ark310/ParquetCachedBatchSerializer.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
@razajafri
Copy link
Collaborator Author

razajafri commented Dec 5, 2020

I have created follow-ons to address your comments @jlowe @sameerz

jlowe
jlowe previously approved these changes Dec 5, 2020
@razajafri
Copy link
Collaborator Author

build

1 similar comment
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

build

1 similar comment
@GaryShen2008
Copy link
Collaborator

build

@razajafri
Copy link
Collaborator Author

build

1 similar comment
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

Blossom is timing out. Kicking it off again

@razajafri
Copy link
Collaborator Author

Build

1 similar comment
@razajafri
Copy link
Collaborator Author

Build

revans2
revans2 previously approved these changes Dec 7, 2020
@razajafri
Copy link
Collaborator Author

build

@razajafri razajafri merged commit 6a53452 into NVIDIA:branch-0.3 Dec 7, 2020
@razajafri razajafri deleted the chunking_input branch December 7, 2020 19:03
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Chunking input

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* improved tests

* updated

* rearranged imports

* renamed the tests

* REVERT THIS COMMIT

This commit should be reverted once we have a framework for selectively
building tests based on Spark version

* Update shims/spark310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* build failure

* empty commit to kick off ci

* fix for building against Spark 3.0.0

* addressed review comments

* addressed review comments

Co-authored-by: Raza Jafri <rjafri@nvidia.com>
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Chunking input

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* improved tests

* updated

* rearranged imports

* renamed the tests

* REVERT THIS COMMIT

This commit should be reverted once we have a framework for selectively
building tests based on Spark version

* Update shims/spark310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* build failure

* empty commit to kick off ci

* fix for building against Spark 3.0.0

* addressed review comments

* addressed review comments

Co-authored-by: Raza Jafri <rjafri@nvidia.com>
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
…IDIA#1265)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Spark 3.1+ Bugs only related to Spark 3.1 or higher
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants