Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inconsistencies in AQE support for broadcast joins #1042

Merged
merged 10 commits into from
Oct 29, 2020

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Oct 29, 2020

This PR fixes some inconsistencies in the logic in broadcast joins for determining whether they can run on the GPU or not when AQE is enabled.

The existing logic for BroadcastNestedLoopJoin was expecting the build side to be GpuBroadcastExchangeExecBase and would fail if the build side was a BroadcastQueryStageExec as is the case when AQE is enabled and when the build side has already been materialized.

Although the logic for BroadcastHashJoin was taking AQE into account, the code was inconsistent and incomplete so this PR addresses that too.

Changes in this PR:

  • New GpuBroadcastJoinMeta base class introduced so that common logic for broadcast nested-loop and hash joins could be introduced for determining when these operators can run on GPU.
  • New unit tests for GpuBroadcastNestedLoopJoin with AQE on and off to confirm that the fixes work.
  • I took the liberty of renaming TestUtils.operatorCount to findOperators since the name was misleading (it returns a list of operators, not a count) and this change only impacted a few lines of existing code.

I manually tested these changes with TPC-DS q90 and confirmed that the query now runs without error.

This closes #1035

Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove self-assigned this Oct 29, 2020
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove changed the title [WIP] Fix inconsistencies in AQE support for broadcast joins Fix inconsistencies in AQE support for broadcast joins Oct 29, 2020
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@tgravescs
Copy link
Collaborator

also can we update the description or issue to have your findings on what the problem is

Signed-off-by: Andy Grove <andygrove@nvidia.com>
revans2
revans2 previously approved these changes Oct 29, 2020
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@revans2
Copy link
Collaborator

revans2 commented Oct 29, 2020

build

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove merged commit ee0fff2 into NVIDIA:branch-0.3 Oct 29, 2020
@andygrove andygrove deleted the broadcast-join-aqe branch October 29, 2020 21:23
@@ -119,6 +119,13 @@ class Spark300Shims extends SparkShims {
}
}

override def isGpuBroadcastNestedLoopJoin(plan: SparkPlan): Boolean = {
plan match {
case _: GpuBroadcastNestedLoopJoinExecBase => true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for my late comment I was looking at something else and realized this doesn't need to be in the shim, you are using the Base class here so there is nothing shim specific about it.

@sameerz sameerz added the bug Something isn't working label Oct 31, 2020
@sameerz sameerz added this to the Oct 26 - Nov 6 milestone Oct 31, 2020
sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this pull request Nov 20, 2020
* Fix inconsistencies with AQE support for broadcast joins

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* code cleanup and change test behavior for Spark 3.0.0

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix inconsistency

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix test failure with Spark 3.1.0

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix inconsistency

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix imports

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix regression

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* tighten up rules

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Move GpuBroadcastJoinMeta to com.nvidia.spark.rapids package

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Move GpuBroadcastJoinMeta to com.nvidia.spark.rapids package

Signed-off-by: Andy Grove <andygrove@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Fix inconsistencies with AQE support for broadcast joins

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* code cleanup and change test behavior for Spark 3.0.0

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix inconsistency

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix test failure with Spark 3.1.0

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix inconsistency

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix imports

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix regression

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* tighten up rules

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Move GpuBroadcastJoinMeta to com.nvidia.spark.rapids package

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Move GpuBroadcastJoinMeta to com.nvidia.spark.rapids package

Signed-off-by: Andy Grove <andygrove@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Fix inconsistencies with AQE support for broadcast joins

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* code cleanup and change test behavior for Spark 3.0.0

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix inconsistency

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix test failure with Spark 3.1.0

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix inconsistency

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix imports

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix regression

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* tighten up rules

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Move GpuBroadcastJoinMeta to com.nvidia.spark.rapids package

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Move GpuBroadcastJoinMeta to com.nvidia.spark.rapids package

Signed-off-by: Andy Grove <andygrove@nvidia.com>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
…IDIA#1042)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] TPC-DS query 90 with AQE enabled fails with doExecuteBroadcast exception
4 participants