-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunking input before writing a ParquetCachedBatch #1265
Conversation
@@ -253,6 +253,16 @@ private case class CloseableColumnBatchIterator(iter: Iterator[ColumnarBatch]) e | |||
* This class assumes, the data is Columnar and the plugin is on | |||
*/ | |||
class ParquetCachedBatchSerializer extends CachedBatchSerializer with Arm { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was made to gain access to the Inner class, open to suggestions to make this better
This is pretty much done. It's still a WIP because the thing remaining is that the tests won't compile with Spark 3.0.0. I have used reflection for some tests that were simple but for others, especially the ones on the CPU, the tests looked really hard to read after I tried reflection. @tgravescs following up on our conversation, I think we should think about alternatives like you said, maybe these tests can live with the shim310? |
As decided in the sprint-planning I am removing the WIP tag and marking this for review. We will include tests for the CPU part when we have a framework to compile tests selectively |
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
This commit should be reverted once we have a framework for selectively building tests based on Spark version
a4835aa
to
0f817e2
Compare
...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala
Show resolved
Hide resolved
...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala
Show resolved
Hide resolved
...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala
Show resolved
Hide resolved
...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala
Show resolved
Hide resolved
...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/ParquetWriterSuite.scala
Outdated
Show resolved
Hide resolved
…ark310/ParquetCachedBatchSerializer.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com>
build |
1 similar comment
build |
build |
1 similar comment
build |
build |
1 similar comment
build |
Blossom is timing out. Kicking it off again |
Build |
1 similar comment
Build |
...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala
Show resolved
Hide resolved
...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala
Show resolved
Hide resolved
...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
build |
* Chunking input Signed-off-by: Raza Jafri <rjafri@nvidia.com> * improved tests * updated * rearranged imports * renamed the tests * REVERT THIS COMMIT This commit should be reverted once we have a framework for selectively building tests based on Spark version * Update shims/spark310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * build failure * empty commit to kick off ci * fix for building against Spark 3.0.0 * addressed review comments * addressed review comments Co-authored-by: Raza Jafri <rjafri@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com>
* Chunking input Signed-off-by: Raza Jafri <rjafri@nvidia.com> * improved tests * updated * rearranged imports * renamed the tests * REVERT THIS COMMIT This commit should be reverted once we have a framework for selectively building tests based on Spark version * Update shims/spark310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * build failure * empty commit to kick off ci * fix for building against Spark 3.0.0 * addressed review comments * addressed review comments Co-authored-by: Raza Jafri <rjafri@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com>
…IDIA#1265) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
This PR deals with breaking up the incoming batches before writing them to a ParquetCachedBatch to make sure the CachedBuffer never exceeds the Int max-value.
I am still working on writing the tests to test the CPU writing. This PR so far only has tests for the
compressColumnarBatchWithParquet