[SPARK-37199][SQL] Add deterministic field to QueryPlan #34470

somani · 2021-11-02T18:16:22Z

What changes were proposed in this pull request?

We have a deterministic field in Expressions to check if an expression is deterministic, but we do not have a similar field in QueryPlan.

We have a need for such a check in the QueryPlan sometimes, like in InlineCTE

This proposal is to add a deterministic field to QueryPlan.

More details in this document: https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#heading=h.4cz970y1mk93

Why are the changes needed?

We have a need for such a check in the QueryPlan sometimes, like in InlineCTE

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests

somani · 2021-11-02T18:29:19Z

cc @cloud-fan @sigmod

HyukjinKwon · 2021-11-03T05:24:28Z

add to whitelist

cloud-fan · 2021-11-03T05:59:43Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

@@ -1931,18 +1931,29 @@ class SubquerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
        sql(
          """
            |SELECT c1, s, s * 10 FROM (
-            |  SELECT c1, (SELECT FIRST(c2) FROM t2 WHERE t1.c1 = t2.c1) s FROM t1)
+            |  SELECT c1, (SELECT MIN(c2) FROM t2 WHERE t1.c1 = t2.c1) s FROM t1)


what's the error if we don't make this change?

This one is fine, but the one below fails with:
Failed to analyze query: org.apache.spark.sql.AnalysisException: nondeterministic expression sum(scalarsubquery(t1.c1)) should not appear in the arguments of an aggregate function.;

Just a side note - I have been arguing, that first/last should be deterministic functions, but it has not gotten any attention - #29810.

Can we not change the test query and assert the error instead?

Just a side note - I have been arguing, that first/last should be deterministic functions

+1 even though FIRST/LAST are not truly deterministic during execution.

The purpose of this field is for determining the eligibility of query rewrites. Postgres has a nice categorization of those:
https://www.postgresql.org/docs/8.3/xfunc-volatility.html

SUM, AVG are not completely deterministic (when running distributed-ly) neither, but we can still do query optimizations over them, and I think it'd be fine for LAST/FIRST too. Differently, rand() has to be marked as non-deterministic because we don't want query rewrites to move, duplicate or dedup it.

SparkQA · 2021-11-03T06:25:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49341/

SparkQA · 2021-11-03T07:10:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49341/

SparkQA · 2021-11-03T10:23:19Z

Test build #144871 has finished for PR 34470 at commit 18f7f17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-11-03T16:29:10Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

            |""".stripMargin),
        correctAnswer)
      checkAnswer(
        sql(
          """
            |SELECT c1, s, s * 10 FROM (
-            |  SELECT c1, SUM((SELECT FIRST(c2) FROM t2 WHERE t1.c1 = t2.c1)) s


Does this query also fail in other databases like pgsql?

Isn't this subquery semantically the same as SELECT c1, SUM((SELECT c2 FROM t2 WHERE t1.c1 = t2.c1 LIMIT 1)) s FROM t1 GROUP BY c1? Spark currently does not support LIMIT to be on the correlation path, but this subquery, according to the current logic, is deterministic.

this subquery, according to the current logic, is deterministic

It seems fine to mark first/last deterministic? #34470 (comment)

#29810 has been merged, @somani can you restore the original test?

Done. Thanks @cloud-fan!

SparkQA · 2021-11-05T11:43:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49401/

SparkQA · 2021-11-05T12:40:56Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49401/

SparkQA · 2021-11-05T16:01:46Z

Test build #144930 has finished for PR 34470 at commit 5ed47eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-11-08T04:08:51Z

thanks, merging to master!

HyukjinKwon · 2021-12-15T01:09:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

+   * Returns true when the all the expressions in the current node as well as all of its children
+   * are deterministic
+   */
+  lazy val deterministic: Boolean = expressions.forall(_.deterministic) &&


qq: should we mark all non-deterministic plans as so? e.g. Sample?

HyukjinKwon · 2021-12-15T01:13:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

+   * Returns true when the all the expressions in the current node as well as all of its children
+   * are deterministic
+   */
+  lazy val deterministic: Boolean = expressions.forall(_.deterministic) &&


Wait .. why is this in query plan? What about physical plans vs logical plans? should both be marked?

I think we should move this to logical plan only since it doesn't make sense physical plans have different determinism.

physical plan can override this lazy val if it has custom logic, right?

can physical plan have a different determinism to ones in logical plan?

e.g., Sample is non-deterministic. I think physical plans of Sample should always be non-deterministic. Otherwise, the output will be inconsistent for which physical plan is used. The opposite case is the same too.

yea, if we override this lazy val in a logical plan, we should do it in the corresponding physical plan as well.

Moving this to logical plan is also OK, if we don't need it in physical plan at all. cc @maryannxue

So if we optimize something, that should always happen in optimizer with logical plans ... right?

If we can do something with physical plans, we will have to add another argument for every non deterministic plan e.g.)

case class Sample( lowerBound: Double, upperBound: Double, withReplacement: Boolean, seed: Long, + deterministic: Boolean, child: LogicalPlan) extends UnaryNode {

case class SampleExec( lowerBound: Double, upperBound: Double, withReplacement: Boolean, seed: Long, + deterministic: Boolean, child: SparkPlan) extends UnaryExecNode with CodegenSupport {

which is pretty much different from how we do in Expression.

Otherwise, we will have to recalculate it for each plan, etc.

…lan (#879) * [SPARK-37199][SQL] Add deterministic field to QueryPlan ### What changes were proposed in this pull request? We have a deterministic field in Expressions to check if an expression is deterministic, but we do not have a similar field in QueryPlan. We have a need for such a check in the QueryPlan sometimes, like in InlineCTE This proposal is to add a deterministic field to QueryPlan. More details in this document: https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#heading=h.4cz970y1mk93 ### Why are the changes needed? We have a need for such a check in the QueryPlan sometimes, like in InlineCTE ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests Closes #34470 from somani/isDeterministic. Authored-by: Abhishek Somani <abhishek.somani@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fe41d18) * Fix test error Co-authored-by: Abhishek Somani <abhishek.somani@databricks.com>

somani added 2 commits November 1, 2021 12:09

Init commit: add deterministic to query plan

8adad2d

Added jira id to test

18f7f17

github-actions bot added the SQL label Nov 2, 2021

HyukjinKwon changed the title ~~[SPARK-37199][SQL]: Add deterministic field to QueryPlan~~ [SPARK-37199][SQL] Add deterministic field to QueryPlan Nov 3, 2021

cloud-fan reviewed Nov 3, 2021

View reviewed changes

somani added 2 commits November 5, 2021 06:43

Merge branch 'master' into isDeterministic

5c73849

Restore test

5ed47eb

cloud-fan approved these changes Nov 5, 2021

View reviewed changes

cloud-fan approved these changes Nov 8, 2021

View reviewed changes

cloud-fan closed this in fe41d18 Nov 8, 2021

HyukjinKwon reviewed Dec 15, 2021

View reviewed changes

WangGuangxin mentioned this pull request Feb 23, 2022

[SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend #35460

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37199][SQL] Add deterministic field to QueryPlan #34470

[SPARK-37199][SQL] Add deterministic field to QueryPlan #34470

somani commented Nov 2, 2021

somani commented Nov 2, 2021

HyukjinKwon commented Nov 3, 2021

cloud-fan Nov 3, 2021

somani Nov 3, 2021

tanelk Nov 3, 2021

sigmod Nov 3, 2021 •

edited

Loading

SparkQA commented Nov 3, 2021

SparkQA commented Nov 3, 2021

SparkQA commented Nov 3, 2021

cloud-fan Nov 3, 2021

allisonwang-db Nov 4, 2021

sigmod Nov 4, 2021

cloud-fan Nov 5, 2021

somani Nov 5, 2021

SparkQA commented Nov 5, 2021

SparkQA commented Nov 5, 2021

SparkQA commented Nov 5, 2021

cloud-fan commented Nov 8, 2021

HyukjinKwon Dec 15, 2021 •

edited

Loading

HyukjinKwon Dec 15, 2021

HyukjinKwon Dec 15, 2021

cloud-fan Dec 15, 2021

HyukjinKwon Dec 15, 2021 •

edited

Loading

cloud-fan Dec 15, 2021

HyukjinKwon Dec 15, 2021 •

edited

Loading

[SPARK-37199][SQL] Add deterministic field to QueryPlan #34470

[SPARK-37199][SQL] Add deterministic field to QueryPlan #34470

Conversation

somani commented Nov 2, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

somani commented Nov 2, 2021

HyukjinKwon commented Nov 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sigmod Nov 3, 2021 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Nov 3, 2021

SparkQA commented Nov 3, 2021

SparkQA commented Nov 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 5, 2021

SparkQA commented Nov 5, 2021

SparkQA commented Nov 5, 2021

cloud-fan commented Nov 8, 2021

HyukjinKwon Dec 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Dec 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Dec 15, 2021 • edited Loading

Choose a reason for hiding this comment

sigmod Nov 3, 2021 •

edited

Loading

HyukjinKwon Dec 15, 2021 •

edited

Loading

HyukjinKwon Dec 15, 2021 •

edited

Loading

HyukjinKwon Dec 15, 2021 •

edited

Loading