spillable cache for GpuCartesianRDD #1784

sperlingxx · 2021-02-22T11:15:07Z

Use SpillableColumnarBatch to cache stream-side data in case of re-computation caused by nested-loop.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

sperlingxx · 2021-02-22T11:15:54Z

build

revans2 · 2021-02-22T14:59:34Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuCartesianProductExec.scala

+    // create a buffer to cache stream-side data in a spillable manner
+    val spillBatchBuffer = mutable.ArrayBuffer[SpillableColumnarBatch]()
+    closeOnExcept(spillBatchBuffer) { buffer =>
+      rdd2.iterator(currSplit.s2, context).foreach { cb =>


@jlowe what do you think about the following proposal?

It would be nice if we could combine this loop with the one below. My thinking is that the stream side is probably large enough that not all of it can fit in memory. That means we will likely spill as we populate spillBatchBuffer, and then spill again each time through the loop as we do the join. If we can combine the two loops so spillBatchBuffer is lazily populated the first time through the inner loop, then hopefully we will spill less because we touch the data one less time.

Yes, agreed. I'd rather not add yet another loop through the data.

I've replaced this with a lazy populating implementation.

revans2 · 2021-02-22T15:00:29Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuCartesianProductExec.scala

+    val spillBatchBuffer = mutable.ArrayBuffer[SpillableColumnarBatch]()
+    closeOnExcept(spillBatchBuffer) { buffer =>
+      rdd2.iterator(currSplit.s2, context).foreach { cb =>
+        // TODO: is it necessary to create a specific spill priorities for spillBatchBuffer?


The priority you set is fine. The issue is with avoiding spilling, and we may want to actually move to more dynamic spill priorities at some point.

revans2 · 2021-02-22T15:02:31Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuCartesianProductExec.scala

+          cb.getBatch, SpillPriorities.ACTIVE_ON_DECK_PRIORITY)
+      }
+    }
+
    rdd1.iterator(currSplit.s1, context).flatMap { lhs =>
      val table = withResource(lhs) { lhs =>


If we are doing this right we need to make lhs spillable too. Because we are going to do the join, and return multiple values while holding on to it. This means we will likely have to modify BroadcastNestedLoopJoinExecBase as well to be able to deal with this. We might need to make it a functor or something like that so we can keep the old behavior for broadcast nested loop join until we can update the broadcast tables to also be spillable, which is coming.

revans2 · 2021-02-22T15:09:39Z

Also I am a little nervous about putting something like this into 0.4. It is for operators that are off by default, but I would feel better if we could retarget this for 0.5 so we have more time to test it.

jlowe

I would feel better if we could retarget this for 0.5 so we have more time to test it.

+1. This should target the 0.5 release.

jlowe · 2021-02-22T15:30:54Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuCartesianProductExec.scala

+    // create a buffer to cache stream-side data in a spillable manner
+    val spillBatchBuffer = mutable.ArrayBuffer[SpillableColumnarBatch]()
+    closeOnExcept(spillBatchBuffer) { buffer =>
+      rdd2.iterator(currSplit.s2, context).foreach { cb =>


Yes, agreed. I'd rather not add yet another loop through the data.

sperlingxx · 2021-02-23T10:07:43Z

build

jlowe · 2021-02-23T22:51:50Z

This is still targeting branch-0.4. It needs to be retargeted to branch-0.5.

* Depend on the cuDF v0.18 Change rapids brannch-0.4 to depend on cuDF v0.18 release jars Prepare for the for the rapids v0.4.0 release Signed-off-by: Tim Liu <timl@nvidia.com> * cudf 0.17-SNAPSHOT to 0.17

…IDIA#1808) * mortgage support multiple dataset formats change mortgage sample class to support dataset formats csv/orc/parquet Signed-off-by: Tim Liu <timl@nvidia.com> * Update 1, copyright 2021 2, throw an error if there are more than 5 arguments 3, match-case optimize Signed-off-by: Tim Liu <timl@nvidia.com> * Update 1, print some helpful info for the input arguments 2, exit instead of exeption, when arguments are wrongly set * fix typo * Fix Nothing value in 'case _ =>' * update

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

Fix merge conflict with branch-0.4

* Spark 3.0.2 shim no longer a snapshot shim Signed-off-by: Jason Lowe <jlowe@nvidia.com> * Remove 3.0.2-SNAPSHOT support

[auto-merge] branch-0.4 to branch-0.5 [skip ci] [bot]

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Signed-off-by: Gera Shegalov <gera@apache.org> Add a shim provider for Spark 3.2.0 development branch. Closes NVIDIA#1490 - fix overflows in aggregate buffer for GpuSum by wiring the explicit output column type - unit tests for the new shim - consolidate version profiles in the parent pom

* Cleanup unused Jenkins files and scripts NVIDIA#1568 Move Databricks scripts to GitLab so we can use the common scripts for the nightly build job and integration tests job Remove unused Dockerfiles Signed-off-by: Tim Liu <timl@nvidia.com> * rm Dockerfile.integration.ubuntu16 * Restore Databricks nightly scripts Signed-off-by: Tim Liu <timl@nvidia.com>

* Spark 3.1.1 shim no longer a snapshot shim Signed-off-by: Jason Lowe <jlowe@nvidia.com> * Remove 3.1.0, 3.1.0-SNAPSHOT, and 3.1.1-SNAPSHOT support * Remove obsolete comment

* Update to note support for 3.0.2 Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Update FAQ to reflect 3.0.2 and 3.1.1 support Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

In the 'Map' of dataset-format, the function of 'Run.csv()/Run.orc()/Run.parquet' will be executed one by one, then it causes the dataset format error, because the dataset format in the current test is 'parquet' Change 'Run.csv()/Run.orc()/Run.parquet' into the lambda expressions, to avoid running the 'Run.xxx()' functions in the dataFrameFormatMap Signed-off-by: Tim Liu <timl@nvidia.com>

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

Port apache/spark#31320 to close NVIDIA#1735 Signed-off-by: Gera Shegalov <gera@apache.org>

Fix merge conflict with branch-0.4

* Update changelog for 0.4 Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Update generate-changelog script Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

[auto-merge] branch-0.4 to branch-0.5 [skip ci] [bot]

* Refactor join code to reduce duplicated code Signed-off-by: Jason Lowe <jlowe@nvidia.com> * Move nodeName override to base class

* Add shim for Spark 3.0.3 Signed-off-by: Jason Lowe <jlowe@nvidia.com> * Add premerge testing for Spark 3.0.2 and Spark 3.0.3

* Fix Part Suite Tests Signed-off-by: Robert (Bobby) Evans <bobby@apache.org> * Addressed review comments

jlowe · 2021-03-03T16:36:18Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuCartesianProductExec.scala

+        rdd2.iterator(currSplit.s2, context).map { serializableBatch =>
+          closeOnExcept(spillBatchBuffer) { buffer =>
+            val batch = SpillableColumnarBatch(
+              serializableBatch.getBatch, SpillPriorities.ACTIVE_ON_DECK_PRIORITY)


This is now missing the spillable callback argument that was added in #1719 which should be used to tie any spilling to spill metrics added to this exec node.

Hi @jlowe, I created another PR #1878 for this improvement, because current is based on branch-0.4 (not branch-0.5).

In the future a new PR isn't necessary when rebasing. You just need to retarget the PR base branch, as was already done for this PR, and then merge in the new base branch.

* Add shim for Spark 3.1.2 Signed-off-by: Jason Lowe <jlowe@nvidia.com> * Add Spark 3.1.2 to premerge testing

* fix shuffle manager doc on ucx library path Signed-off-by: Rong Ou <rong.ou@gmail.com> * remove ld library path line Signed-off-by: Rong Ou <rong.ou@gmail.com>

…VIDIA#1871) Signed-off-by: Jason Lowe <jlowe@nvidia.com>

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

Fix merge conflict with branch-0.4

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

jlowe · 2021-03-05T14:33:57Z

Superceded by #1878

spillable cache for GpuCartesianRDD

86fc104

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

revans2 reviewed Feb 22, 2021

View reviewed changes

jlowe requested changes Feb 22, 2021

View reviewed changes

lazy cache

07b2d15

GaryShen2008 assigned sperlingxx Mar 1, 2021

sperlingxx changed the base branch from branch-0.4 to branch-0.5 March 1, 2021 10:30

NvTimLiu and others added 19 commits March 2, 2021 00:20

Update cudf dependency to 0.18 (NVIDIA#1828)

72b2e12

* Depend on the cuDF v0.18 Change rapids brannch-0.4 to depend on cuDF v0.18 release jars Prepare for the for the rapids v0.4.0 release Signed-off-by: Tim Liu <timl@nvidia.com> * cudf 0.17-SNAPSHOT to 0.17

Merge branch 'branch-0.4' into fix-merge

653c33a

Remove benchmarks (NVIDIA#1826)

17657fe

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

Merge branch 'branch-0.4' into fix-merge

c40ec37

Merge pull request NVIDIA#1835 from jlowe/fix-merge

50fd165

Fix merge conflict with branch-0.4

Spark 3.0.2 shim no longer a snapshot shim (NVIDIA#1831)

6483543

* Spark 3.0.2 shim no longer a snapshot shim Signed-off-by: Jason Lowe <jlowe@nvidia.com> * Remove 3.0.2-SNAPSHOT support

Merge pull request NVIDIA#1837 from NVIDIA/branch-0.4

c52e9a5

[auto-merge] branch-0.4 to branch-0.5 [skip ci] [bot]

Make databricks build.sh more convenient for dev (NVIDIA#1838)

7e210c2

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Spark 3.1.1 shim no longer a snapshot shim (NVIDIA#1832)

e614ef4

* Spark 3.1.1 shim no longer a snapshot shim Signed-off-by: Jason Lowe <jlowe@nvidia.com> * Remove 3.1.0, 3.1.0-SNAPSHOT, and 3.1.1-SNAPSHOT support * Remove obsolete comment

Update to note support for 3.0.2 (NVIDIA#1842)

fc9cecf

* Update to note support for 3.0.2 Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Update FAQ to reflect 3.0.2 and 3.1.1 support Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

Have most of range partitioning run on the GPU (NVIDIA#1796)

63a2e3d

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

Merge branch 'branch-0.4' into fix-merge

c776be9

Fix NullPointerException on null partition insert (NVIDIA#1744)

e06c226

Port apache/spark#31320 to close NVIDIA#1735 Signed-off-by: Gera Shegalov <gera@apache.org>

Merge branch 'branch-0.4' into fix-merge

5b93033

Merge pull request NVIDIA#1848 from jlowe/fix-merge

923fa4e

Fix merge conflict with branch-0.4

sameerz and others added 6 commits March 2, 2021 13:56

Update changelog for 0.4 (NVIDIA#1849)

dea867a

* Update changelog for 0.4 Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Update generate-changelog script Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

Merge pull request NVIDIA#1850 from NVIDIA/branch-0.4

95c3e75

[auto-merge] branch-0.4 to branch-0.5 [skip ci] [bot]

Refactor join code to reduce duplicated code (NVIDIA#1839)

40c0eda

* Refactor join code to reduce duplicated code Signed-off-by: Jason Lowe <jlowe@nvidia.com> * Move nodeName override to base class

Add shim for Spark 3.0.3 (NVIDIA#1834)

19d1f05

* Add shim for Spark 3.0.3 Signed-off-by: Jason Lowe <jlowe@nvidia.com> * Add premerge testing for Spark 3.0.2 and Spark 3.0.3

Cost-based optimizer (NVIDIA#1616)

32213fa

Fix Part Suite Tests (NVIDIA#1852)

24ab0ae

* Fix Part Suite Tests Signed-off-by: Robert (Bobby) Evans <bobby@apache.org> * Addressed review comments

jlowe requested changes Mar 3, 2021

View reviewed changes

Add shim for Spark 3.1.2 (NVIDIA#1836)

eab507e

* Add shim for Spark 3.1.2 Signed-off-by: Jason Lowe <jlowe@nvidia.com> * Add Spark 3.1.2 to premerge testing

sameerz added the task Work required that improves the product but is not user facing label Mar 4, 2021

rongou and others added 10 commits March 4, 2021 08:30

fix shuffle manager doc on ucx library path (NVIDIA#1858)

ad0b6d9

* fix shuffle manager doc on ucx library path Signed-off-by: Rong Ou <rong.ou@gmail.com> * remove ld library path line Signed-off-by: Rong Ou <rong.ou@gmail.com>

Disable coalesce batch spilling to avoid cudf contiguous_split bug (N…

dc2847c

…VIDIA#1871) Signed-off-by: Jason Lowe <jlowe@nvidia.com>

Merge branch 'branch-0.4' into fix-merge

3c243c7

Fix tests for Spark 3.2.0 shim (NVIDIA#1869)

2439b4b

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

Add in support for DateAddInterval (NVIDIA#1841)

6e57e27

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

Merge pull request NVIDIA#1875 from jlowe/fix-merge

60fb754

Fix merge conflict with branch-0.4

spillable cache for GpuCartesianRDD

1a32484

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

lazy cache

b908c73

adapt new interface of SpillableColumnarBatch

4718d00

fix merge conflicts

6fd391b

sperlingxx requested review from GaryShen2008, NvTimLiu and tgravescs as code owners March 5, 2021 09:50

fix merge conflicts

8a46305

sperlingxx mentioned this pull request Mar 5, 2021

spillable cache for GpuCartesianRDD #1878

Merged

jlowe closed this Mar 5, 2021

sperlingxx deleted the spill_cart_rdd branch April 8, 2021 03:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spillable cache for GpuCartesianRDD #1784

spillable cache for GpuCartesianRDD #1784

sperlingxx commented Feb 22, 2021

sperlingxx commented Feb 22, 2021

revans2 Feb 22, 2021

jlowe Feb 22, 2021

sperlingxx Feb 23, 2021

revans2 Feb 22, 2021

revans2 Feb 22, 2021

revans2 commented Feb 22, 2021

jlowe left a comment

jlowe Feb 22, 2021

sperlingxx commented Feb 23, 2021

jlowe commented Feb 23, 2021

jlowe Mar 3, 2021

sperlingxx Mar 5, 2021

jlowe Mar 5, 2021

jlowe commented Mar 5, 2021

spillable cache for GpuCartesianRDD #1784

spillable cache for GpuCartesianRDD #1784

Conversation

sperlingxx commented Feb 22, 2021

sperlingxx commented Feb 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 commented Feb 22, 2021

jlowe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx commented Feb 23, 2021

jlowe commented Feb 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlowe commented Mar 5, 2021