-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable UCX + AQE #613
Enable UCX + AQE #613
Conversation
...src/main/scala/org/apache/spark/sql/rapids/shims/spark310/RapidsShuffleInternalManager.scala
Show resolved
Hide resolved
Signed-off-by: Andy Grove <andygrove@nvidia.com>
build |
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
build |
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
build |
Signed-off-by: Andy Grove <andygrove@nvidia.com>
…supported Signed-off-by: Andy Grove <andygrove@nvidia.com>
build |
Signed-off-by: Andy Grove <andygrove@nvidia.com>
build |
shims/spark301/src/main/scala/com/nvidia/spark/rapids/shims/spark301/Spark301Shims.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsMeta.scala
Outdated
Show resolved
Hide resolved
exchanges.foreach(_.willNotWorkOnGpu(consistentExchangeMessage)) | ||
} | ||
} else if (exchanges.isEmpty) { | ||
// we only have query stages, and we cannot do anything to modify them at this point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the side effect of not being able to tag them if they already started executing? I like the comments here, but they almost point at some indeterminate state where work already started.
// since they already started to execute | ||
} else { | ||
// there is a mix of ShuffleExchangeExec and ShuffleQueryStageExec so we need to tag any | ||
// ShuffleExchangeExec nodes based on the other nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if any exchange or query exchange is not on the GPU, all exchanges and query exchanges must fallback to the CPU, that feed to the join? Just making sure that's the end result.
Should that happen also for queryStages
? E.g. if exchanges
are on the cpu, but queryStages
are on the gpu, the following code will just tag the exchanges
again for the cpu (which seems unnecessary, but I could be misreading it). Wonder if we need to also tell the queryStages
that they can't work because of the consistent hashing issue.
Oh @andygrove so there are changes to the joins here too. Should we change the title of the PR to reflect? |
Yes, the scope definitely increased here. I should probably split this up into two separate PRs. |
I am ok with it going in one, just think it could be renamed. Especially if you think it’s good to go. Maybe “shuffle fixes for AQE”? |
I went ahead and split it out anyway, so this PR now builds on #676 I will try and address your comments over there. |
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@abellina This PR now only has the necessary changes for UCX support. |
build |
1 similar comment
build |
* Support UCX when AQE is enabled Signed-off-by: Andy Grove <andygrove@nvidia.com> * Bug fix: Shuffle exchange needs to look at tagged plan Signed-off-by: Andy Grove <andygrove@nvidia.com> * re-implement makeShuffleConsistent to support AQE Signed-off-by: Andy Grove <andygrove@nvidia.com> * update comments Signed-off-by: Andy Grove <andygrove@nvidia.com> * revert tpcds bench changes Signed-off-by: Andy Grove <andygrove@nvidia.com> * make comments more consistent Signed-off-by: Andy Grove <andygrove@nvidia.com> * unit test Signed-off-by: Andy Grove <andygrove@nvidia.com> * code refinement Signed-off-by: Andy Grove <andygrove@nvidia.com> * remove redundant repartitioning from test Signed-off-by: Andy Grove <andygrove@nvidia.com> * fix incorrect logic in test for determining which spark versions are supported Signed-off-by: Andy Grove <andygrove@nvidia.com> * fix scalastyle Signed-off-by: Andy Grove <andygrove@nvidia.com> * Remove changes unrelated to supporting UCX Signed-off-by: Andy Grove <andygrove@nvidia.com>
* Support UCX when AQE is enabled Signed-off-by: Andy Grove <andygrove@nvidia.com> * Bug fix: Shuffle exchange needs to look at tagged plan Signed-off-by: Andy Grove <andygrove@nvidia.com> * re-implement makeShuffleConsistent to support AQE Signed-off-by: Andy Grove <andygrove@nvidia.com> * update comments Signed-off-by: Andy Grove <andygrove@nvidia.com> * revert tpcds bench changes Signed-off-by: Andy Grove <andygrove@nvidia.com> * make comments more consistent Signed-off-by: Andy Grove <andygrove@nvidia.com> * unit test Signed-off-by: Andy Grove <andygrove@nvidia.com> * code refinement Signed-off-by: Andy Grove <andygrove@nvidia.com> * remove redundant repartitioning from test Signed-off-by: Andy Grove <andygrove@nvidia.com> * fix incorrect logic in test for determining which spark versions are supported Signed-off-by: Andy Grove <andygrove@nvidia.com> * fix scalastyle Signed-off-by: Andy Grove <andygrove@nvidia.com> * Remove changes unrelated to supporting UCX Signed-off-by: Andy Grove <andygrove@nvidia.com>
[auto-merge] bot-auto-merge-branch-22.10 to branch-22.12 [skip ci] [bot]
Enables UCX + AQE.
This closes #455