Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the incomplete capture of SubqueryBroadcast #4630

Merged

Conversation

sperlingxx
Copy link
Collaborator

Signed-off-by: sperlingxx lovedreamf@gmail.com

Fixes #4625

This PR is to fix the logic on searching/matching potential SubqueryBroadcastExec from partition filters of scans. In previous, we missed a possible condition: As a subquery, SubqueryBroadcastExec may be reused. And this missing will lead to a crash when AQE is on.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx sperlingxx changed the title Fixed the incomplete capture of SubqueryBroadcast Fix the incomplete capture of SubqueryBroadcast Jan 26, 2022
@jlowe jlowe added this to the Jan 10 - Jan 28 milestone Jan 26, 2022
@jlowe
Copy link
Member

jlowe commented Jan 26, 2022

I verified this fixes the query 5 crash, so this is definitely better than where we are now. However I do wonder about the case where Spark tries to reuse a GpuBroadcastExec when the subsequent AdaptiveSparkPlan will not be on the GPU and it's not because of DPP. I suspect Spark will inject a ColumnarToRow to try to smooth that transition, but will the plugin realize it needs to replace that with a GpuColumnarToRow transition to make that actually work? And in that case, it's a bit unfortunate we would transfer the shuffle to the host, send it to the device, then send it right back to the host for further processing. It would make more sense to convert the GPU-formatted columnar data on the host to row-based data on the host directly, but we could optimize that in the future.

My primary concern is whether the scenario will work at all or whether we'll end up with a Spark ColumnarToRowExec trying to parse data from a GPU exchange.

@sameerz sameerz added the bug Something isn't working label Jan 26, 2022
@sperlingxx
Copy link
Collaborator Author

sperlingxx commented Jan 27, 2022

I verified this fixes the query 5 crash, so this is definitely better than where we are now. However I do wonder about the case where Spark tries to reuse a GpuBroadcastExec when the subsequent AdaptiveSparkPlan will not be on the GPU and it's not because of DPP. I suspect Spark will inject a ColumnarToRow to try to smooth that transition, but will the plugin realize it needs to replace that with a GpuColumnarToRow transition to make that actually work? And in that case, it's a bit unfortunate we would transfer the shuffle to the host, send it to the device, then send it right back to the host for further processing. It would make more sense to convert the GPU-formatted columnar data on the host to row-based data on the host directly, but we could optimize that in the future.

My primary concern is whether the scenario will work at all or whether we'll end up with a Spark ColumnarToRowExec trying to parse data from a GPU exchange.

Hi @jlowe, I checked the Spark code of the main branch. So far, a BroadcastExchange will be introduced under two conditions:

  1. BroadcastJoins (including BroadcastHashJoin and BroadcastNestedLoopJoin) which require BroadcastDistributions.
  2. DynamicPruningFilters who are able to reuse existing BroadcastExchange.

I don't think we need to worry about the first condition, because BroadcastJoinMetas will fallback the build side once the join can not work on GPU.
In terms of the second condition, the AdaptiveSparkPlan who holds the BroadcastExchange is guaranteed to be the child of a SubqueryBroadcastExec. Therefore, we are able to prevent the C2R injection through making the AdaptiveSparkPlan as a columnar plan if necessary. The risk is whether we are able to capture all occurrences of the pattern ( SubqueryBroadcast(AdaptiveSparkPlan(BroadcastExchange))) before the execution of the adaptive subquery.

@sperlingxx sperlingxx merged commit 30096e7 into NVIDIA:branch-22.02 Jan 27, 2022
@sperlingxx sperlingxx deleted the hotfix_reused_subquerybroadcast branch January 27, 2022 02:49
@jlowe
Copy link
Member

jlowe commented Jan 27, 2022

I don't think we need to worry about the first condition, because BroadcastJoinMetas will fallback the build side once the join can not work on GPU.

I'm worried about reuse in light of AQE. Are you saying it's impossible for a GpuBroadcastExchange to be reused as a BroadcastExchange in the plan? If so, doesn't that imply the plan to compute the broadcast ends up getting re-executed in that case (and thus somewhat relates to #3709)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] NDS query 5 fails with AdaptiveSparkPlanExec assertion
3 participants