[SPARK-31809][SQL] Infer IsNotNull from some special equality join keys #28642

wangyum · 2020-05-26T09:52:56Z

What changes were proposed in this pull request?

We can infer IsNotNull from some special equality join keys. For example:

CREATE TABLE t1(a string, b string, c string) using parquet;
CREATE TABLE t2(a string, b decimal(38, 18), c string) using parquet;
SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.a, t1.b)=t2.a; -- case 1
SELECT t1.* FROM t1 JOIN t2 ON CAST(t1.a AS DOUBLE)=CAST(t2.b AS DOUBLE); -- case 2

The coalesce(t1.a, t1.b) or CAST(t1.a AS DOUBLE) may generate a lot of null values, which will lead to skew join.
After this pr:

== Physical Plan ==
*(5) Project [a#5, b#6, c#7]
+- *(5) SortMergeJoin [coalesce(a#5, b#6)], [a#8], Inner
   :- *(2) Sort [coalesce(a#5, b#6) ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(coalesce(a#5, b#6), 200), true, [id=#44]
   :     +- *(1) Filter isnotnull(coalesce(a#5, b#6))
   :        +- Scan hive default.t1 [a#5, b#6, c#7], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5, b#6, c#7], Statistics(sizeInBytes=8.0 EiB)
   +- *(4) Sort [a#8 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(a#8, 200), true, [id=#52]
         +- *(3) Filter isnotnull(a#8)
            +- Scan hive default.t2 [a#8], HiveTableRelation `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#8, b#9, c#10], Statistics(sizeInBytes=8.0 EiB)

Why are the changes needed?

Avoid skew join in some cases.
Hive support this optimization.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and benchmark test:
Case1:

Before this PR	After this PR

Case2:

Before this PR	After this PR

SparkQA · 2020-05-26T12:14:26Z

Test build #123119 has finished for PR 28642 at commit d657299.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-06-05T06:58:27Z

...rc/test/scala/org/apache/spark/sql/catalyst/optimizer/InferFiltersFromConstraintsSuite.scala

+    testConstraintsAfterJoin(
+      testRelation.subquery('left),
+      testRelation.subquery('right),
+      testRelation.where(IsNotNull(Coalesce(Seq('a, 'b)))).subquery('left),


hive> EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.a, t1.b)=t2.a; OK STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: $hdt$_0:t1 Fetch Operator limit: -1 Alias -> Map Local Operator Tree: $hdt$_0:t1 TableScan alias: t1 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: COALESCE(a,b) is not null (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: a (type: string), b (type: string), c (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE HashTable Sink Operator keys: 0 COALESCE(_col0,_col1) (type: string) 1 _col0 (type: string) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: t2 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: a is not null (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: a (type: string) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 COALESCE(_col0,_col1) (type: string) 1 _col0 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Execution mode: vectorized Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

wangyum · 2020-06-05T06:59:40Z

...rc/test/scala/org/apache/spark/sql/catalyst/optimizer/InferFiltersFromConstraintsSuite.scala

+
+  test("Should not infer IsNotNull for non null-intolerant child from same table") {
+    comparePlans(Optimize.execute(testRelation.where(Coalesce(Seq('a, 'b)) === 'c).analyze),
+      testRelation.where(Coalesce(Seq('a, 'b)) === 'c && IsNotNull('c)).analyze)


hive> EXPLAIN SELECT t1.* FROM t1 WHERE coalesce(t1.a, t1.b)=t1.c; OK STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: t1 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: (COALESCE(a,b) = c) (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: a (type: string), b (type: string), c (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE ListSink Time taken: 4.026 seconds, Fetched: 20 row(s)

wangyum · 2020-06-05T07:08:15Z

retest this please

SparkQA · 2020-06-05T10:52:05Z

Test build #123553 has finished for PR 28642 at commit a5f52a8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-07T20:08:21Z

Test build #123607 has finished for PR 28642 at commit 65cd324.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-06-08T03:36:11Z

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala

@@ -1039,7 +1039,7 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
    val pythonEvals = collect(joinNode.get) {
      case p: BatchEvalPythonExec => p
    }
-    assert(pythonEvals.size == 2)
+    assert(pythonEvals.size == 4)


@HyukjinKwon I'm not sure if this change can optimize python udf?

Yeah, I don't think it's more efficient to have BatchEvalPythonExec more. It will require more Python executions which aren't trivial.

I quickly checked:

== Physical Plan == *(3) Project [a#225, b#226, c#236, d#237] +- *(3) BroadcastHashJoin [cast(pythonUDF0#256 as int)], [cast(pythonUDF0#257 as int)], Inner, BuildRight :- BatchEvalPython [udf(cast(a#225 as string))], [pythonUDF0#256] : +- *(1) Project [_1#220 AS a#225, _2#221 AS b#226] : +- *(1) Project [_1#220, _2#221] : +- *(1) Filter isnotnull(cast(pythonUDF0#254 as int)) : +- BatchEvalPython [udf(cast(_1#220 as string))], [pythonUDF0#254] : +- LocalTableScan [_1#220, _2#221] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(cast(input[2, string, true] as int) as bigint))), [id=#140] +- BatchEvalPython [udf(cast(c#236 as string))], [pythonUDF0#257] +- *(2) Project [_1#231 AS c#236, _2#232 AS d#237] +- *(2) Project [_1#231, _2#232] +- *(2) Filter isnotnull(cast(pythonUDF0#255 as int)) +- BatchEvalPython [udf(cast(_1#231 as string))], [pythonUDF0#255] +- LocalTableScan [_1#231, _2#232]

We should probably avoid inferring the is-not-null filter in this case.

HyukjinKwon · 2020-06-15T05:38:46Z

retest this please

SparkQA · 2020-06-15T07:05:02Z

Test build #124032 has finished for PR 28642 at commit 65cd324.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-06-15T07:10:43Z

retest this please

HyukjinKwon · 2020-06-15T08:50:01Z

cc @cloud-fan FYI

SparkQA · 2020-06-15T12:21:57Z

Test build #124038 has finished for PR 28642 at commit 65cd324.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2020-09-24T00:47:55Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

SparkQA · 2021-08-10T09:14:31Z

Test build #142271 has finished for PR 28642 at commit 65cd324.

This patch fails Python style tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2021-08-10T10:50:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46779/

SparkQA · 2021-08-10T10:59:43Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46779/

SparkQA · 2021-08-10T14:58:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46784/

SparkQA · 2021-08-10T18:29:21Z

Test build #142276 has finished for PR 28642 at commit b100902.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-11T00:13:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46806/

SparkQA · 2021-08-11T00:50:58Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46806/

SparkQA · 2021-08-11T04:09:24Z

Test build #142299 has finished for PR 28642 at commit 3643e3f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-16T06:27:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46988/

SparkQA · 2021-08-16T07:08:22Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46988/

SparkQA · 2021-10-27T12:54:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49129/

SparkQA · 2021-10-27T13:38:25Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49129/

wangyum · 2021-10-27T13:39:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+  private def resultMayBeNull(e: Expression): Boolean = e match {
+    case Cast(child, dataType, _, _) => !Cast.canUpCast(child.dataType, dataType)
+    case _: Coalesce => true
+    case _ => false
+  }


@cloud-fan @HyukjinKwon It will not infer all equality join keys. For example:

Infer Will not infer

cast(strCol AS double) = doubleCol upper(strCol) = upperStrCol

SparkQA · 2021-10-27T15:01:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49130/

SparkQA · 2021-10-27T15:46:55Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49130/

SparkQA · 2021-10-27T16:17:49Z

Test build #144659 has finished for PR 28642 at commit a9eb7de.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-27T22:31:30Z

Test build #144661 has finished for PR 28642 at commit 7796e5c.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-28T01:07:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49149/

SparkQA · 2021-10-28T02:06:42Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49149/

SparkQA · 2021-10-28T05:28:21Z

Test build #144680 has finished for PR 28642 at commit c88566a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

SparkQA · 2021-10-28T07:22:16Z

Test build #144703 has finished for PR 28642 at commit 919492e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-28T09:20:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49172/

SparkQA · 2021-10-28T10:18:02Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49172/

cloud-fan · 2021-10-29T09:13:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+  private def resultMayBeNull(exp: Expression): Boolean = exp match {
+    case e if !e.nullable => false
+    case Cast(child: Attribute, dataType, _, _) => !Cast.canUpCast(child.dataType, dataType)
+    case c: Coalesce if c.children.forall(_.isInstanceOf[Attribute]) => true


Can't we rely on the NullIntolerant interface?

We can infer NullIntolerant already. For example:

spark.sql("create table t1 (id string, value int) using parquet") spark.sql("create table t2 (id int, value int) using parquet") spark.sql("select * from t1 join t2 on t1.id = t2.id").explain("extended") == Optimized Logical Plan == Join Inner, (cast(id#0 as int) = id#2) :- Filter isnotnull(id#0) : +- Relation default.t1[id#0,value#1] parquet +- Filter isnotnull(id#2) +- Relation default.t2[id#2,value#3] parquet

Cast is NullIntolerant. We can infer IsNotNull(t1.id) already. But I also want to Infer isnotnull(cast(t1.id as int)) because t1.id may contains many strings that can not be casted to int.

cloud-fan · 2021-10-29T09:14:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -1215,6 +1215,15 @@ object InferFiltersFromConstraints extends Rule[LogicalPlan]
    }
  }

+  // Whether the result of this expression may be null. For example: CAST(strCol AS double)
+  // We will infer an IsNotNull expression for this expression to avoid skew join.


is it better to infer IsNotNull(col) instead of IsNotNull(CAST(col AS other_type))?

We can infer IsNotNull(col) already. For example:

spark.sql("create table t1 (id string, value int) using parquet") spark.sql("create table t2 (id int, value int) using parquet") spark.sql("select * from t1 join t2 on t1.id = t2.id").explain("extended")

Before this pr:

== Optimized Logical Plan == Join Inner, (cast(id#0 as int) = id#2) :- Filter isnotnull(id#0) : +- Relation default.t1[id#0,value#1] parquet +- Filter isnotnull(id#2) +- Relation default.t2[id#2,value#3] parquet

After this pr:

== Optimized Logical Plan == Join Inner, (cast(id#0 as int) = id#2) :- Filter (isnotnull(id#0) AND isnotnull(cast(id#0 as int))) : +- Relation default.t1[id#0,value#1] parquet +- Filter isnotnull(id#2) +- Relation default.t2[id#2,value#3] parquet

Infer isnotnull(cast(t1.id as int)) may filter out many strings that can not be casted to int.

github-actions · 2022-02-10T00:12:57Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

probot-autolabeler bot added the SQL label May 26, 2020

probot-autolabeler bot added the PYTHON label Jun 5, 2020

wangyum changed the title ~~[SPARK-31809][SQL] Infer IsNotNull for all children of NullIntolerant expression~~ [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition Jun 5, 2020

wangyum commented Jun 5, 2020

View reviewed changes

apache deleted a comment from SparkQA Jun 5, 2020

wangyum commented Jun 8, 2020

View reviewed changes

github-actions bot added the Stale label Sep 24, 2020

wangyum closed this Sep 24, 2020

wangyum reopened this Aug 10, 2021

wangyum removed the Stale label Aug 10, 2021

wangyum changed the title ~~[SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition~~ [SPARK-31809][SQL] Infer IsNotNull from join condition Aug 10, 2021

[SPARK-31809][SQL] Infer IsNotNull from join condition

a9eb7de

wangyum changed the title ~~[SPARK-31809][SQL] Infer IsNotNull from join condition~~ [SPARK-31809][SQL] Infer IsNotNull from some special equality join keys Oct 27, 2021

Fix

7796e5c

wangyum commented Oct 27, 2021

View reviewed changes

Fix test in JoinSuit

c88566a

tanelk reviewed Oct 28, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Show resolved Hide resolved

nullability

919492e

cloud-fan reviewed Oct 29, 2021

View reviewed changes

wangyum mentioned this pull request Nov 5, 2021

[SPARK-36290][SQL] Pull out join condition #33522

Closed

github-actions bot added the Stale label Feb 10, 2022

github-actions bot closed this Feb 11, 2022

wangyum mentioned this pull request Oct 2, 2022

[SPARK-36290][SQL] Pull out complex join condition #38071

Closed

[SPARK-31809][SQL] Infer IsNotNull from some special equality join keys #28642

[SPARK-31809][SQL] Infer IsNotNull from some special equality join keys #28642

Conversation

wangyum commented May 26, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented May 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangyum commented Jun 5, 2020

SparkQA commented Jun 5, 2020

SparkQA commented Jun 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 15, 2020

SparkQA commented Jun 15, 2020

HyukjinKwon commented Jun 15, 2020

HyukjinKwon commented Jun 15, 2020

SparkQA commented Jun 15, 2020

github-actions bot commented Sep 24, 2020

SparkQA commented Aug 10, 2021

SparkQA commented Aug 10, 2021

SparkQA commented Aug 10, 2021

SparkQA commented Aug 10, 2021

SparkQA commented Aug 10, 2021

SparkQA commented Aug 11, 2021

SparkQA commented Aug 11, 2021

SparkQA commented Aug 11, 2021

SparkQA commented Aug 16, 2021

SparkQA commented Aug 16, 2021

SparkQA commented Oct 27, 2021

SparkQA commented Oct 27, 2021

wangyum Oct 27, 2021 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Oct 27, 2021

SparkQA commented Oct 27, 2021

SparkQA commented Oct 27, 2021

SparkQA commented Oct 27, 2021

SparkQA commented Oct 28, 2021

SparkQA commented Oct 28, 2021

SparkQA commented Oct 28, 2021

SparkQA commented Oct 28, 2021

SparkQA commented Oct 28, 2021

SparkQA commented Oct 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 10, 2022

wangyum commented May 26, 2020 •

edited

Loading

wangyum Oct 27, 2021 •

edited

Loading