Chunking input before writing a ParquetCachedBatch #1265

razajafri · 2020-12-03T22:58:26Z

This PR deals with breaking up the incoming batches before writing them to a ParquetCachedBatch to make sure the CachedBuffer never exceeds the Int max-value.

I am still working on writing the tests to test the CPU writing. This PR so far only has tests for the compressColumnarBatchWithParquet

razajafri · 2020-12-04T09:30:48Z

...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala

@@ -253,6 +253,16 @@ private case class CloseableColumnBatchIterator(iter: Iterator[ColumnarBatch]) e
 * This class assumes, the data is Columnar and the plugin is on
 */
 class ParquetCachedBatchSerializer extends CachedBatchSerializer with Arm {
+


This change was made to gain access to the Inner class, open to suggestions to make this better

razajafri · 2020-12-04T09:37:18Z

This is pretty much done. It's still a WIP because the thing remaining is that the tests won't compile with Spark 3.0.0. I have used reflection for some tests that were simple but for others, especially the ones on the CPU, the tests looked really hard to read after I tried reflection.

@tgravescs following up on our conversation, I think we should think about alternatives like you said, maybe these tests can live with the shim310?

razajafri · 2020-12-04T19:54:17Z

As decided in the sprint-planning I am removing the WIP tag and marking this for review. We will include tests for the CPU part when we have a framework to compile tests selectively

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

This commit should be reverted once we have a framework for selectively building tests based on Spark version

...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala

tests/src/test/scala/com/nvidia/spark/rapids/ParquetWriterSuite.scala

…ark310/ParquetCachedBatchSerializer.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com>

razajafri · 2020-12-05T01:25:32Z

I have created follow-ons to address your comments @jlowe @sameerz

razajafri · 2020-12-05T05:09:46Z

build

razajafri · 2020-12-05T05:42:51Z

build

razajafri · 2020-12-05T06:05:38Z

build

GaryShen2008 · 2020-12-05T06:53:13Z

build

razajafri · 2020-12-05T19:02:30Z

build

razajafri · 2020-12-05T20:40:08Z

build

razajafri · 2020-12-06T02:32:18Z

Blossom is timing out. Kicking it off again

razajafri · 2020-12-06T02:32:26Z

Build

razajafri · 2020-12-06T10:07:30Z

Build

pom.xml

...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala

tests/pom.xml

tests/src/test/scala/com/nvidia/spark/rapids/ParquetWriterSuite.scala

razajafri · 2020-12-07T17:20:28Z

build

* Chunking input Signed-off-by: Raza Jafri <rjafri@nvidia.com> * improved tests * updated * rearranged imports * renamed the tests * REVERT THIS COMMIT This commit should be reverted once we have a framework for selectively building tests based on Spark version * Update shims/spark310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * build failure * empty commit to kick off ci * fix for building against Spark 3.0.0 * addressed review comments * addressed review comments Co-authored-by: Raza Jafri <rjafri@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com>

…IDIA#1265) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

razajafri requested review from GaryShen2008, jlowe, NvTimLiu, revans2 and tgravescs as code owners December 3, 2020 22:58

razajafri commented Dec 4, 2020

View reviewed changes

This was referenced Dec 4, 2020

[FEA] Plugable Cache #444

Closed

[BUG] In ParqueCachedBatchSerializer, serializing parquet buffers might blow up in certain cases #685

Closed

razajafri added 6 commits December 4, 2020 12:02

Chunking input

d0b8d3f

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

improved tests

9f5c059

updated

f92ed1f

rearranged imports

4d8f4f0

renamed the tests

7d91c53

REVERT THIS COMMIT

0f817e2

This commit should be reverted once we have a framework for selectively building tests based on Spark version

razajafri force-pushed the chunking_input branch from a4835aa to 0f817e2 Compare December 4, 2020 21:40

razajafri changed the title ~~[WIP] Chunking input before writing a ParquetCachedBatch~~ Chunking input before writing a ParquetCachedBatch Dec 4, 2020

jlowe requested changes Dec 4, 2020

View reviewed changes

jlowe added the Spark 3.1+ Bugs only related to Spark 3.1 or higher label Dec 4, 2020

Update shims/spark310/src/main/scala/com/nvidia/spark/rapids/shims/sp…

553ebb4

…ark310/ParquetCachedBatchSerializer.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com>

razajafri mentioned this pull request Dec 5, 2020

Unused code needs to be removed from ParquetCachedBatchSerializer #1289

Closed

jlowe previously approved these changes Dec 5, 2020

View reviewed changes

razajafri dismissed jlowe’s stale review via 57c9712 December 5, 2020 05:09

empty commit to kick off ci

8bdb6f2

fix for building against Spark 3.0.0

3dc86a2

revans2 reviewed Dec 7, 2020

View reviewed changes

addressed review comments

df8c972

revans2 previously approved these changes Dec 7, 2020

View reviewed changes

addressed review comments

81282df

razajafri dismissed revans2’s stale review via 81282df December 7, 2020 17:18

revans2 approved these changes Dec 7, 2020

View reviewed changes

razajafri merged commit 6a53452 into NVIDIA:branch-0.3 Dec 7, 2020

razajafri deleted the chunking_input branch December 7, 2020 19:03

razajafri mentioned this pull request Dec 8, 2020

ParquetCachedBatchSerializer code cleanup #1311

Merged

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023

Update submodule cudf to 3bacb12deca71667646057d1790f0786c20c2d53 (NV…

c7d2a82

…IDIA#1265) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking input before writing a ParquetCachedBatch #1265

Chunking input before writing a ParquetCachedBatch #1265

razajafri commented Dec 3, 2020

razajafri Dec 4, 2020

razajafri commented Dec 4, 2020

razajafri commented Dec 4, 2020

razajafri commented Dec 5, 2020 •

edited

Loading

razajafri commented Dec 5, 2020

razajafri commented Dec 5, 2020

razajafri commented Dec 5, 2020

GaryShen2008 commented Dec 5, 2020

razajafri commented Dec 5, 2020

razajafri commented Dec 5, 2020

razajafri commented Dec 6, 2020

razajafri commented Dec 6, 2020

razajafri commented Dec 6, 2020

razajafri commented Dec 7, 2020

Chunking input before writing a ParquetCachedBatch #1265

Chunking input before writing a ParquetCachedBatch #1265

Conversation

razajafri commented Dec 3, 2020

razajafri Dec 4, 2020

Choose a reason for hiding this comment

razajafri commented Dec 4, 2020

razajafri commented Dec 4, 2020

razajafri commented Dec 5, 2020 • edited Loading

razajafri commented Dec 5, 2020

razajafri commented Dec 5, 2020

razajafri commented Dec 5, 2020

GaryShen2008 commented Dec 5, 2020

razajafri commented Dec 5, 2020

razajafri commented Dec 5, 2020

razajafri commented Dec 6, 2020

razajafri commented Dec 6, 2020

razajafri commented Dec 6, 2020

razajafri commented Dec 7, 2020

razajafri commented Dec 5, 2020 •

edited

Loading