Fix the incomplete capture of SubqueryBroadcast #4630

sperlingxx · 2022-01-26T10:57:41Z

Signed-off-by: sperlingxx lovedreamf@gmail.com

This PR is to fix the logic on searching/matching potential SubqueryBroadcastExec from partition filters of scans. In previous, we missed a possible condition: As a subquery, SubqueryBroadcastExec may be reused. And this missing will lead to a crash when AQE is on.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

sperlingxx · 2022-01-26T11:00:13Z

build

jlowe · 2022-01-26T16:11:39Z

I verified this fixes the query 5 crash, so this is definitely better than where we are now. However I do wonder about the case where Spark tries to reuse a GpuBroadcastExec when the subsequent AdaptiveSparkPlan will not be on the GPU and it's not because of DPP. I suspect Spark will inject a ColumnarToRow to try to smooth that transition, but will the plugin realize it needs to replace that with a GpuColumnarToRow transition to make that actually work? And in that case, it's a bit unfortunate we would transfer the shuffle to the host, send it to the device, then send it right back to the host for further processing. It would make more sense to convert the GPU-formatted columnar data on the host to row-based data on the host directly, but we could optimize that in the future.

My primary concern is whether the scenario will work at all or whether we'll end up with a Spark ColumnarToRowExec trying to parse data from a GPU exchange.

sperlingxx · 2022-01-27T02:48:08Z

I verified this fixes the query 5 crash, so this is definitely better than where we are now. However I do wonder about the case where Spark tries to reuse a GpuBroadcastExec when the subsequent AdaptiveSparkPlan will not be on the GPU and it's not because of DPP. I suspect Spark will inject a ColumnarToRow to try to smooth that transition, but will the plugin realize it needs to replace that with a GpuColumnarToRow transition to make that actually work? And in that case, it's a bit unfortunate we would transfer the shuffle to the host, send it to the device, then send it right back to the host for further processing. It would make more sense to convert the GPU-formatted columnar data on the host to row-based data on the host directly, but we could optimize that in the future.

My primary concern is whether the scenario will work at all or whether we'll end up with a Spark ColumnarToRowExec trying to parse data from a GPU exchange.

Hi @jlowe, I checked the Spark code of the main branch. So far, a BroadcastExchange will be introduced under two conditions:

BroadcastJoins (including BroadcastHashJoin and BroadcastNestedLoopJoin) which require BroadcastDistributions.
DynamicPruningFilters who are able to reuse existing BroadcastExchange.

I don't think we need to worry about the first condition, because BroadcastJoinMetas will fallback the build side once the join can not work on GPU.
In terms of the second condition, the AdaptiveSparkPlan who holds the BroadcastExchange is guaranteed to be the child of a SubqueryBroadcastExec. Therefore, we are able to prevent the C2R injection through making the AdaptiveSparkPlan as a columnar plan if necessary. The risk is whether we are able to capture all occurrences of the pattern ( SubqueryBroadcast(AdaptiveSparkPlan(BroadcastExchange))) before the execution of the adaptive subquery.

jlowe · 2022-01-27T14:16:49Z

I don't think we need to worry about the first condition, because BroadcastJoinMetas will fallback the build side once the join can not work on GPU.

I'm worried about reuse in light of AQE. Are you saying it's impossible for a GpuBroadcastExchange to be reused as a BroadcastExchange in the plan? If so, doesn't that imply the plan to compute the broadcast ends up getting re-executed in that case (and thus somewhat relates to #3709)?

sperlingxx added 2 commits January 26, 2022 18:47

fix reused SubqueryBroadcast

80420b9

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

typo

6c6afcb

sperlingxx requested review from andygrove and jlowe January 26, 2022 10:57

sperlingxx changed the title ~~Fixed the incomplete capture of SubqueryBroadcast~~ Fix the incomplete capture of SubqueryBroadcast Jan 26, 2022

jlowe added this to the Jan 10 - Jan 28 milestone Jan 26, 2022

jlowe approved these changes Jan 26, 2022

View reviewed changes

sameerz added the bug Something isn't working label Jan 26, 2022

sameerz assigned sperlingxx Jan 26, 2022

sperlingxx merged commit 30096e7 into NVIDIA:branch-22.02 Jan 27, 2022

sperlingxx deleted the hotfix_reused_subquerybroadcast branch January 27, 2022 02:49

jlowe mentioned this pull request May 11, 2022

Fix query planning issue when Iceberg is used with DPP and AQE #5465

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the incomplete capture of SubqueryBroadcast #4630

Fix the incomplete capture of SubqueryBroadcast #4630

sperlingxx commented Jan 26, 2022

sperlingxx commented Jan 26, 2022

jlowe commented Jan 26, 2022

sperlingxx commented Jan 27, 2022 •

edited

Loading

jlowe commented Jan 27, 2022

Fix the incomplete capture of SubqueryBroadcast #4630

Fix the incomplete capture of SubqueryBroadcast #4630

Conversation

sperlingxx commented Jan 26, 2022

sperlingxx commented Jan 26, 2022

jlowe commented Jan 26, 2022

sperlingxx commented Jan 27, 2022 • edited Loading

jlowe commented Jan 27, 2022

sperlingxx commented Jan 27, 2022 •

edited

Loading