[SPARK-34081][SQL] Only pushdown LeftSemi/LeftAnti over Aggregate if join can be planned as broadcast join #31145

wangyum · 2021-01-12T05:14:11Z

What changes were proposed in this pull request?

Should not pushdown LeftSemi/LeftAnti over Aggregate for some cases.

spark.range(50000000L).selectExpr("id % 10000 as a", "id % 10000 as b").write.saveAsTable("t1")
spark.range(40000000L).selectExpr("id % 8000 as c", "id % 8000 as d").write.saveAsTable("t2")
spark.sql("SELECT distinct a, b FROM t1 INTERSECT SELECT distinct c, d FROM t2").explain

Before this pr:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[a#16L, b#17L], functions=[])
   +- HashAggregate(keys=[a#16L, b#17L], functions=[])
      +- HashAggregate(keys=[a#16L, b#17L], functions=[])
         +- Exchange hashpartitioning(a#16L, b#17L, 5), ENSURE_REQUIREMENTS, [id=#72]
            +- HashAggregate(keys=[a#16L, b#17L], functions=[])
               +- SortMergeJoin [coalesce(a#16L, 0), isnull(a#16L), coalesce(b#17L, 0), isnull(b#17L)], [coalesce(c#18L, 0), isnull(c#18L), coalesce(d#19L, 0), isnull(d#19L)], LeftSemi
                  :- Sort [coalesce(a#16L, 0) ASC NULLS FIRST, isnull(a#16L) ASC NULLS FIRST, coalesce(b#17L, 0) ASC NULLS FIRST, isnull(b#17L) ASC NULLS FIRST], false, 0
                  :  +- Exchange hashpartitioning(coalesce(a#16L, 0), isnull(a#16L), coalesce(b#17L, 0), isnull(b#17L), 5), ENSURE_REQUIREMENTS, [id=#65]
                  :     +- FileScan parquet default.t1[a#16L,b#17L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint,b:bigint>
                  +- Sort [coalesce(c#18L, 0) ASC NULLS FIRST, isnull(c#18L) ASC NULLS FIRST, coalesce(d#19L, 0) ASC NULLS FIRST, isnull(d#19L) ASC NULLS FIRST], false, 0
                     +- Exchange hashpartitioning(coalesce(c#18L, 0), isnull(c#18L), coalesce(d#19L, 0), isnull(d#19L), 5), ENSURE_REQUIREMENTS, [id=#66]
                        +- HashAggregate(keys=[c#18L, d#19L], functions=[])
                           +- Exchange hashpartitioning(c#18L, d#19L, 5), ENSURE_REQUIREMENTS, [id=#61]
                              +- HashAggregate(keys=[c#18L, d#19L], functions=[])
                                 +- FileScan parquet default.t2[c#18L,d#19L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c:bigint,d:bigint>

After this pr:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[a#16L, b#17L], functions=[])
   +- Exchange hashpartitioning(a#16L, b#17L, 5), ENSURE_REQUIREMENTS, [id=#74]
      +- HashAggregate(keys=[a#16L, b#17L], functions=[])
         +- SortMergeJoin [coalesce(a#16L, 0), isnull(a#16L), coalesce(b#17L, 0), isnull(b#17L)], [coalesce(c#18L, 0), isnull(c#18L), coalesce(d#19L, 0), isnull(d#19L)], LeftSemi
            :- Sort [coalesce(a#16L, 0) ASC NULLS FIRST, isnull(a#16L) ASC NULLS FIRST, coalesce(b#17L, 0) ASC NULLS FIRST, isnull(b#17L) ASC NULLS FIRST], false, 0
            :  +- Exchange hashpartitioning(coalesce(a#16L, 0), isnull(a#16L), coalesce(b#17L, 0), isnull(b#17L), 5), ENSURE_REQUIREMENTS, [id=#67]
            :     +- HashAggregate(keys=[a#16L, b#17L], functions=[])
            :        +- Exchange hashpartitioning(a#16L, b#17L, 5), ENSURE_REQUIREMENTS, [id=#61]
            :           +- HashAggregate(keys=[a#16L, b#17L], functions=[])
            :              +- FileScan parquet default.t1[a#16L,b#17L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint,b:bigint>
            +- Sort [coalesce(c#18L, 0) ASC NULLS FIRST, isnull(c#18L) ASC NULLS FIRST, coalesce(d#19L, 0) ASC NULLS FIRST, isnull(d#19L) ASC NULLS FIRST], false, 0
               +- Exchange hashpartitioning(coalesce(c#18L, 0), isnull(c#18L), coalesce(d#19L, 0), isnull(d#19L), 5), ENSURE_REQUIREMENTS, [id=#68]
                  +- HashAggregate(keys=[c#18L, d#19L], functions=[])
                     +- Exchange hashpartitioning(c#18L, d#19L, 5), ENSURE_REQUIREMENTS, [id=#63]
                        +- HashAggregate(keys=[c#18L, d#19L], functions=[])
                           +- FileScan parquet default.t2[c#18L,d#19L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c:bigint,d:bigint>

Why are the changes needed?

Pushdown LeftSemi/LeftAnti over Aggregate will affect performance.
It will remove user added DISTINCT operator, e.g.: q38, q87.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and benchmark test.

SQL	Before this PR(Seconds)	After this PR(Seconds)
q14a	660	594
q14b	660	600
q38	55	29
q87	66	35

Before this pr:

After this pr:

SparkQA · 2021-01-12T06:19:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38544/

SparkQA · 2021-01-12T06:48:40Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38544/

SparkQA · 2021-01-12T10:13:01Z

Test build #133957 has finished for PR 31145 at commit fc06d46.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-01-13T09:06:49Z

This also changes the plan of q14a, q14b, does it cause perf regression?

I agree that it's unsure if pushing down left join through aggregate is beneficial or not, as both of them can reduce data volume. I have a simple heuristic: we look at the size metrics and see if the left join can be planned as broadcast join. If it can, then it's very likely that pushing it down is beneficial.

cloud-fan · 2021-01-13T09:11:56Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushDownLeftSemiAntiJoin.scala

+        if aggs.forall(_.deterministic) && groups.nonEmpty &&
+        !aggs.exists(ScalarSubquery.hasCorrelatedScalarSubquery) &&
+        !(cond.nonEmpty && groups.equals(aggs) &&
+          cond.forall(e => splitConjunctivePredicates(e).forall(_.isInstanceOf[EqualNullSafe]))) =>


We can add a new method to JoinSelectionHelper

def canPlanAsBroadcastHashJoin(join: Join, conf: SQLConf): Boolean = { getBroadcastBuildSide(join.left, join.right, join.joinType, join.hint, hintOnly = true, conf).isDefined || getBroadcastBuildSide(join.left, join.right, join.joinType, join.hint, hintOnly = false, conf).isDefined }

and then use it here:
if ... && canPlanAsBroadcastHashJoin(join, conf)

wangyum · 2021-01-13T12:55:43Z

This also changes the plan of q14a, q14b, does it cause perf regression?

No.

cloud-fan · 2021-01-13T13:04:36Z

sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q14a/explain.txt

@@ -334,7 +334,7 @@ Results [3]: [brand_id#13, class_id#14, category_id#15]



We don't need to change this file

cloud-fan · 2021-01-13T13:04:55Z

sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q14b/explain.txt

@@ -319,7 +319,7 @@ Results [3]: [brand_id#13, class_id#14, category_id#15]



we don't need to change this file.

cloud-fan · 2021-01-13T13:05:07Z

sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q38/explain.txt

@@ -174,7 +174,7 @@ Results [3]: [c_last_name#17, c_first_name#16, d_date#14]



cloud-fan · 2021-01-13T13:05:21Z

sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q87/explain.txt

@@ -174,7 +174,7 @@ Results [3]: [c_last_name#17, c_first_name#16, d_date#14]



cloud-fan · 2021-01-13T13:05:34Z

sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q14/explain.txt

@@ -319,7 +319,7 @@ Results [3]: [brand_id#13, class_id#14, category_id#15]



cloud-fan · 2021-01-13T13:06:42Z

Since it changes the final plan of 4 TPCDS queries, can we put the benchmark result for all of these 4 queries even though some of them have no perf change?

wangyum · 2021-01-13T13:10:57Z

Since it changes the final plan of 4 TPCDS queries, can we put the benchmark result for all of these 4 queries even though some of them have no perf change?

Yes. I have put the benchmark results to pr description.

SparkQA · 2021-01-13T14:31:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38602/

SparkQA · 2021-01-13T15:01:46Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38602/

SparkQA · 2021-01-13T18:11:53Z

Test build #134015 has finished for PR 31145 at commit 6f0ba0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-01-14T04:37:53Z

thanks, merging to master!

github-actions bot added the SQL label Jan 12, 2021

wangyum requested review from cloud-fan and gatorsmile January 12, 2021 06:48

cloud-fan reviewed Jan 13, 2021

View reviewed changes

only push down if join can be planned as broadcast join.

78f772e

wangyum changed the title ~~[SPARK-24081][SQL] Should not pushdown LeftSemi/LeftAnti over Aggregate for some cases~~ [SPARK-24081][SQL] Only pushdown LeftSemi/LeftAnti over Aggregate if join can be planned as broadcast join Jan 13, 2021

wangyum changed the title ~~[SPARK-24081][SQL] Only pushdown LeftSemi/LeftAnti over Aggregate if join can be planned as broadcast join~~ [SPARK-34081][SQL] Only pushdown LeftSemi/LeftAnti over Aggregate if join can be planned as broadcast join Jan 13, 2021

cloud-fan reviewed Jan 13, 2021

View reviewed changes

fix

6f0ba0a

cloud-fan approved these changes Jan 13, 2021

View reviewed changes

cloud-fan closed this in d3ea308 Jan 14, 2021

wangyum deleted the SPARK-34081 branch January 14, 2021 04:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34081][SQL] Only pushdown LeftSemi/LeftAnti over Aggregate if join can be planned as broadcast join #31145

[SPARK-34081][SQL] Only pushdown LeftSemi/LeftAnti over Aggregate if join can be planned as broadcast join #31145

wangyum commented Jan 12, 2021 •

edited

Loading

SparkQA commented Jan 12, 2021

SparkQA commented Jan 12, 2021

SparkQA commented Jan 12, 2021

cloud-fan commented Jan 13, 2021

cloud-fan Jan 13, 2021

wangyum commented Jan 13, 2021

cloud-fan Jan 13, 2021

cloud-fan Jan 13, 2021

cloud-fan Jan 13, 2021

cloud-fan Jan 13, 2021

cloud-fan Jan 13, 2021

cloud-fan commented Jan 13, 2021

wangyum commented Jan 13, 2021

SparkQA commented Jan 13, 2021

SparkQA commented Jan 13, 2021

SparkQA commented Jan 13, 2021

cloud-fan commented Jan 14, 2021

		@@ -334,7 +334,7 @@ Results [3]: [brand_id#13, class_id#14, category_id#15]

		@@ -319,7 +319,7 @@ Results [3]: [brand_id#13, class_id#14, category_id#15]

		@@ -174,7 +174,7 @@ Results [3]: [c_last_name#17, c_first_name#16, d_date#14]

[SPARK-34081][SQL] Only pushdown LeftSemi/LeftAnti over Aggregate if join can be planned as broadcast join #31145

[SPARK-34081][SQL] Only pushdown LeftSemi/LeftAnti over Aggregate if join can be planned as broadcast join #31145

Conversation

wangyum commented Jan 12, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jan 12, 2021

SparkQA commented Jan 12, 2021

SparkQA commented Jan 12, 2021

cloud-fan commented Jan 13, 2021

cloud-fan Jan 13, 2021

Choose a reason for hiding this comment

wangyum commented Jan 13, 2021

cloud-fan Jan 13, 2021

Choose a reason for hiding this comment

cloud-fan Jan 13, 2021

Choose a reason for hiding this comment

cloud-fan Jan 13, 2021

Choose a reason for hiding this comment

cloud-fan Jan 13, 2021

Choose a reason for hiding this comment

cloud-fan Jan 13, 2021

Choose a reason for hiding this comment

cloud-fan commented Jan 13, 2021

wangyum commented Jan 13, 2021

SparkQA commented Jan 13, 2021

SparkQA commented Jan 13, 2021

SparkQA commented Jan 13, 2021

cloud-fan commented Jan 14, 2021

wangyum commented Jan 12, 2021 •

edited

Loading