Update partitioning logic in ShuffledBatchRDD #319

andygrove · 2020-07-02T22:30:05Z

This PR updates the logic in ShuffledBatchRDD to reflect recent changes in Spark's ShuffledRowRDD related to AQE support.

Note that there are oustanding questions about how these changes affect UCX, so UCX is disabled for now when AQE is enabled.

andygrove · 2020-07-02T22:30:33Z

@abellina Could you take a look when you get a chance? This is one of the changes I had to make when working on the AQE POC.

andygrove · 2020-07-02T22:33:16Z

build

kuhushukla · 2020-07-02T22:44:50Z

I'll try and chime in this weekend on this change.

abellina · 2020-07-02T23:04:46Z

To me at high level this looks fine. It is echoing the changes made in the row-based ShuffledRowRDD. The question I have is how it works with the shuffle plugin. Given AQE, with the shuffle plugin, the expectation is that this will likely fail, so we may want to turn off the shuffle plugin in that case (it could be a separate bug). I'll test with your patch (AQE) and the shuffle plugin also.

abellina · 2020-07-06T13:13:38Z

@andygrove I haven't had time to test this yet, but I am fairly sure it will fail if the shuffle plugin is enabled, I'll try this today with some help from you likely (not sure what I can run with it enabled).

Here's what I am thinking: the writes are going to be cached and stamped as rapids blocks, and then the reads are going to ignore this all together (the getReaderForRange falls back as if it were a CPU shuffle). We do need to implement getReaderForRange for the plugin, but one way to get around this is to revert back to the legacy shuffle if AQE is on for now (https://github.com/NVIDIA/spark-rapids/blob/branch-0.2/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/RapidsShuffleInternalManager.scala#L194).

abellina · 2020-07-15T18:35:59Z

@andygrove here's the issue I mentioned on the read side: #362. I think it's a separate issue, that should come with tests also (somehow)

andygrove · 2020-07-28T21:00:48Z

build

revans2 · 2020-07-30T21:33:22Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/RapidsShuffleInternalManager.scala

@@ -203,7 +204,12 @@ abstract class RapidsShuffleInternalManagerBase(conf: SparkConf, isDriver: Boole
      logWarning("Rapids Shuffle Plugin is falling back to SortShuffleManager because " +
        "external shuffle is enabled")
    }
-    fallThroughDueToExternalShuffle
+    val isAdaptiveEnabled = conf.get(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key, "false").toBoolean


There is a stringDefault value in SQLConf.ADAPTIVE_EXECUTION_ENABLED which should let us stay up to date with the default if it changes. The main problem with that is that we are compiling this against one specific version of spark so if the default changes from one version to another this will not stay up to date.

This is something that we could potentially do in the shim layer though so that it does compile against each version we support. I'll look into doing that.

@andygrove we should wait on this PR then? Or you are thinking a different PR?

I'll add it to this PR.

Actually, this is a boolean config so it has to have a value and it doesn't make sense to check for a default value. I'll remove the default part instead.

This is updated now.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/RapidsShuffleInternalManager.scala

andygrove · 2020-07-31T13:46:39Z

build

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove · 2020-07-31T19:03:43Z

build

andygrove · 2020-07-31T19:53:34Z

There was a build failure with the 3.1.0 shim:

13:44:04  [ERROR] case class GpuBroadcastHashJoinExec(
13:44:04  [ERROR]            ^
13:44:04  [ERROR] /ansible-managed/jenkins-slave/slave4/workspace/spark/rapids_premerge-github/shims/spark310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/GpuShuffledHashJoinExec.scala:77: class GpuShuffledHashJoinExec needs to be abstract, since:
13:44:04  it has 2 unimplemented members.
13:44:04  /** As seen from class GpuShuffledHashJoinExec, the missing signatures are as follows.
13:44:04   *  For convenience, these are usable as stub implementations.
13:44:04   */
13:44:04    // Members declared in org.apache.spark.sql.execution.CodegenSupport
13:44:04    def inputRDDs(): Seq[org.apache.spark.rdd.RDD[org.apache.spark.sql.catalyst.InternalRow]] = ???
13:44:04    
13:44:04    // Members declared in org.apache.spark.sql.execution.joins.HashJoin
13:44:04    protected def prepareRelation(ctx: org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext): (String, Boolean) = ???

Seems to be related to this recent Spark commit:

apache/spark@ae82768

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove · 2020-08-01T12:23:39Z

build

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove requested a review from abellina July 2, 2020 22:30

andygrove self-assigned this Jul 2, 2020

andygrove added the feature request New feature or request label Jul 2, 2020

andygrove added this to the Jul 6 - Jul 17 milestone Jul 2, 2020

sameerz modified the milestones: Jul 6 - Jul 17, Jul 20 - Jul 31 Jul 18, 2020

andygrove mentioned this pull request Jul 28, 2020

[FEA] Support UCX shuffle with optimized AQE #455

Closed

andygrove changed the title ~~[WIP] Update partitioning logic in ShuffledBatchRDD~~ Update partitioning logic in ShuffledBatchRDD Jul 28, 2020

andygrove mentioned this pull request Jul 29, 2020

Implement optimized AQE support so that exchanges run on GPU where possible #462

Merged

revans2 reviewed Jul 30, 2020

View reviewed changes

abellina reviewed Jul 30, 2020

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/RapidsShuffleInternalManager.scala Outdated Show resolved Hide resolved

abellina previously approved these changes Jul 30, 2020

View reviewed changes

update shuffle partitioning logic

02a48b6

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove dismissed abellina’s stale review via 02a48b6 July 31, 2020 16:59

andygrove force-pushed the update-shuffle-logic branch from e3c7495 to 02a48b6 Compare July 31, 2020 16:59

abellina approved these changes Jul 31, 2020

View reviewed changes

andygrove added 2 commits July 31, 2020 17:11

Merge branch 'branch-0.2' into update-shuffle-logic

be79e8a

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Merge branch 'branch-0.2' into update-shuffle-logic

a3eb7b3

Signed-off-by: Andy Grove <andygrove@nvidia.com>

sameerz modified the milestones: Jul 20 - Jul 31, Aug 3 - Aug 14 Aug 3, 2020

andygrove merged commit 9d88311 into NVIDIA:branch-0.2 Aug 3, 2020

andygrove deleted the update-shuffle-logic branch August 3, 2020 13:22

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

update shuffle partitioning logic (NVIDIA#319)

5086e66

Signed-off-by: Andy Grove <andygrove@nvidia.com>

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

update shuffle partitioning logic (NVIDIA#319)

d478fa6

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update partitioning logic in ShuffledBatchRDD #319

Update partitioning logic in ShuffledBatchRDD #319

andygrove commented Jul 2, 2020 •

edited

Loading

andygrove commented Jul 2, 2020

andygrove commented Jul 2, 2020

kuhushukla commented Jul 2, 2020

abellina commented Jul 2, 2020

abellina commented Jul 6, 2020

abellina commented Jul 15, 2020

andygrove commented Jul 28, 2020

revans2 Jul 30, 2020

andygrove Jul 30, 2020

abellina Jul 30, 2020

andygrove Jul 30, 2020

andygrove Jul 30, 2020

andygrove Jul 30, 2020

andygrove commented Jul 31, 2020

andygrove commented Jul 31, 2020

andygrove commented Jul 31, 2020

andygrove commented Aug 1, 2020

Update partitioning logic in ShuffledBatchRDD #319

Update partitioning logic in ShuffledBatchRDD #319

Conversation

andygrove commented Jul 2, 2020 • edited Loading

andygrove commented Jul 2, 2020

andygrove commented Jul 2, 2020

kuhushukla commented Jul 2, 2020

abellina commented Jul 2, 2020

abellina commented Jul 6, 2020

abellina commented Jul 15, 2020

andygrove commented Jul 28, 2020

revans2 Jul 30, 2020

Choose a reason for hiding this comment

andygrove Jul 30, 2020

Choose a reason for hiding this comment

abellina Jul 30, 2020

Choose a reason for hiding this comment

andygrove Jul 30, 2020

Choose a reason for hiding this comment

andygrove Jul 30, 2020

Choose a reason for hiding this comment

andygrove Jul 30, 2020

Choose a reason for hiding this comment

andygrove commented Jul 31, 2020

andygrove commented Jul 31, 2020

andygrove commented Jul 31, 2020

andygrove commented Aug 1, 2020

andygrove commented Jul 2, 2020 •

edited

Loading