[SPARK-32940][SQL] Collect, first and last should be deterministic aggregate functions #29810

tanelk · 2020-09-19T18:22:57Z

What changes were proposed in this pull request?

Collect, first and last have mistakenly been marked as non-deterministic. They are actually deterministic iff their child expression is deterministic.

For example collect was marked as non-deterministic in #14749. The reasoning was that its output depends on the actual order of input rows. Although it is correct that these aggregators depend on the order of input rows, it does not make them non-deterministic.

In EliminateSorts optimizer rule, there is a method isOrderIrrelevantAggs, that lists all aggregators that do not depend on their input row order. Collect, first and last are correctly not listed there.
An aggregator would be non-deterministic if its output for a group would depend on previous groups it has aggregated - I can't think of any practical examples of this kind of aggregator in Spark.

An analogous aggregator to these would be sum on float and double datatype - its result does depend on the order of its inputs, but is deterministic. Another similar aggregates are the max_by and min_by - deterministic functions, that can return different results when the order of rows changes.

Why are the changes needed?

The optimizer rule PushPredicateThroughNonJoin can work in more cases.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

This reverts commit 281ed68

This reverts commit 317f313

This reverts commit cf6c7e9

# Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparisonSuite.scala

tanelk · 2020-09-19T18:28:46Z

...c/test/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparisonSuite.scala

-    Seq(positiveInt, negativeInt).foreach (v => {
-      val e = Cast(First(f, ignoreNulls = true), IntegerType) <=> v
+    Seq(positiveLong, negativeLong).foreach (v => {
+      val e = Cast(SparkPartitionID(), LongType) <=> v
      assertEquivalent(e, e, evaluate = false)
-      val e2 = Cast(Literal(30.toShort), IntegerType) >= v
+      val e2 = Cast(Literal(30), LongType) >= v
      assertEquivalent(e2, e2, evaluate = false)


There was no other non-deterministic expression, that can return a short, so I had to change this test a bit.

tanelk · 2020-09-19T18:32:09Z

@cloud-fan, you reviewed the related pull request (although years back).

dongjoon-hyun · 2020-09-20T01:09:46Z

ok to test

HyukjinKwon · 2020-09-20T01:18:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala

@@ -63,9 +63,6 @@ case class First(child: Expression, ignoreNulls: Boolean)

  override def nullable: Boolean = true

-  // First is not a deterministic function.
-  override lazy val deterministic: Boolean = false


I think you may need to update the note above and says like "The function can be non-deterministic because its results depend on the order of input rows which are usually non-deterministic after a shuffle." You might need to update functions.py, functions.R and functions.scala

SparkQA · 2020-09-20T03:57:10Z

Test build #128898 has finished for PR 29810 at commit b9fd2f1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-20T07:05:01Z

Test build #128910 has finished for PR 29810 at commit b0919a2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-21T07:59:37Z

In EliminateSorts optimizer rule, there is a method isOrderIrrelevantAggs, that lists all aggregators that do not depend on their input row order. Collect, first and last are correctly not listed there.

hmm, it's pretty weird if we list first/last there. Removing sort will definitely change the query result, doesn't it?

tanelk · 2020-09-21T09:34:35Z

In EliminateSorts optimizer rule, there is a method isOrderIrrelevantAggs, that lists all aggregators that do not depend on their input row order. Collect, first and last are correctly not listed there.

hmm, it's pretty weird if we list first/last there. Removing sort will definitely change the query result, doesn't it?

Sorry if I didn't word it correctly - these are not listed there. I tried to exemplify the difference between deterministic and order irrelevant.

hvanhovell · 2020-09-21T11:41:57Z

Maybe I am missing something here. AFAIK the problem with First/Last/CollectList methods is that we can't control how results are merged. This depends on how we shuffle fetches results and this is not deterministic.

tanelk · 2020-09-21T12:57:37Z

Maybe I am missing something here. AFAIK the problem with First/Last/CollectList methods is that we can't control how results are merged. This depends on how we shuffle fetches results and this is not deterministic.

You are 100% correct. As a user, this is how I would also understand the term deterministic.
But, internally deterministic has different meaning - by this definition Sum should be also non-deterministic if its input type is float or double.

I'll copy our internal definition:

   * Note that this means that an expression should be considered as non-deterministic if:
   * - it relies on some mutable internal state, or
   * - it relies on some implicit input that is not part of the children expression list.
   * - it has non-deterministic child or children.
   * - it assumes the input satisfies some certain condition via the child operator.

For aggregation expressions the internal state part can introduce extra confusion - of course all of them have some internal state about the current group they are aggregating (running count, largest value seen so far, etc), but they do not "remember" the previous groups they have aggregated.

There is a separate optimizer rule EliminateSorts, that keeps track of aggregators, that do not depend on input order - max, count, etc. But these are a subset of all deterministic aggregators.

For context, why this is relevant:
A snippet from PushPredicateThroughNonJoin

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

Lines 1138 to 1141 in c336ddf

    
           case filter @ Filter(condition, aggregate: Aggregate) 
        
             if aggregate.aggregateExpressions.forall(_.deterministic) 
        
               && aggregate.groupingExpressions.nonEmpty => 
        
             val aliasMap = getAliasMap(aggregate)

Basically this case will filter out groups in the aggregation before aggregating the values. Within one group the aggregator will still see all the same rows in the same order, but it would not see the groups, that were filtered out. This would change the output of an aggregator, that remembers previous groups (non-deterministic), but it would not change the output of an aggregator, that only cares about the current group (deterministic, but possibly order relevant).

# Conflicts: # python/pyspark/sql/functions.py # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparisonSuite.scala

SparkQA · 2020-12-30T21:50:18Z

Test build #133545 has finished for PR 29810 at commit a080b53.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-31T02:52:13Z

Test build #133547 has finished for PR 29810 at commit dc6e7c0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-24T16:17:24Z

Test build #140262 has finished for PR 29810 at commit e5e9a04.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-09-30T08:33:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48272/

SparkQA · 2021-09-30T08:54:40Z

Test build #143761 has finished for PR 29810 at commit 56fbf15.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-09-30T09:32:01Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48272/

SparkQA · 2021-10-19T09:20:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48881/

SparkQA · 2021-10-19T10:02:13Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48881/

SparkQA · 2021-10-19T12:45:03Z

Test build #144407 has finished for PR 29810 at commit 0d40311.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HistogramPlotBase(NumericPlotBase):
class KdePlotBase(NumericPlotBase):
new_class = type(NameTypeHolder.short_name, (NameTypeHolder,),
class Database(NamedTuple):
class Table(NamedTuple):
class Column(NamedTuple):
class Function(NamedTuple):
class SparkUpgradeException(CapturedException):
protected class YarnSchedulerEndpoint(override val rpcEnv: RpcEnv)
public class ExpressionImplUtils
class IndexAlreadyExistsException(message: String, cause: Option[Throwable] = None)
class NoSuchIndexException(message: String, cause: Option[Throwable] = None)
trait ExtractValue extends Expression with NullIntolerant
case class AesEncrypt(input: Expression, key: Expression, child: Expression)
case class AesDecrypt(input: Expression, key: Expression, child: Expression)
case class SetCatalogAndNamespace(child: LogicalPlan) extends UnaryCommand
case class CreateFunction(
case class CreateView(
case class SetCatalogCommand(catalogName: String) extends LeafRunnableCommand
case class SetNamespaceCommand(namespace: Seq[String]) extends LeafRunnableCommand
case class ShowCatalogsCommand(pattern: Option[String]) extends LeafRunnableCommand
case class HashedRelationBroadcastMode(key: Seq[Expression], isNullAware: Boolean = false)

cloud-fan · 2021-11-04T17:35:03Z

There is inevitable randomness in the input of aggregate functions, because the shuffle reader may produce data with random orders, and we are not able to completely eliminate this randomness. For example, even sum has randomness, as adding up floating values with different orders can lead to slightly different results.

I don't think first/last/collect has a significant difference and +1 to mark them as deterministic.

cloud-fan · 2021-11-04T17:35:32Z

retest this please

cloud-fan · 2021-11-04T17:37:42Z

...c/test/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparisonSuite.scala

-    Seq(positiveInt, negativeInt).foreach(v => {
-      val e = Cast(First(f, ignoreNulls = true), IntegerType) <=> v
+   Seq(positiveLong, negativeLong).foreach (v => {
+     val e = Cast(SparkPartitionID(), LongType) <=> v


can we use Rand?

cloud-fan · 2021-11-04T17:38:22Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateDistinctSuite.scala

    CollectSet(_: Expression)
  ).foreach {
    aggBuilder =>
      val agg = aggBuilder('a)
-      test(s"Eliminate Distinct in ${agg.prettyName}") {
+      test(s"Eliminate Distinct in ${agg.toString}") {


nit: just $agg

SparkQA · 2021-11-04T18:28:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49375/

SparkQA · 2021-11-04T19:21:30Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49375/

AmplabJenkins · 2021-11-04T19:40:45Z

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49375/

SparkQA · 2021-11-04T22:45:41Z

Test build #144906 has finished for PR 29810 at commit 0d40311.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HistogramPlotBase(NumericPlotBase):
class KdePlotBase(NumericPlotBase):
new_class = type(NameTypeHolder.short_name, (NameTypeHolder,),
class Database(NamedTuple):
class Table(NamedTuple):
class Column(NamedTuple):
class Function(NamedTuple):
class SparkUpgradeException(CapturedException):
protected class YarnSchedulerEndpoint(override val rpcEnv: RpcEnv)
public class ExpressionImplUtils
class IndexAlreadyExistsException(message: String, cause: Option[Throwable] = None)
class NoSuchIndexException(message: String, cause: Option[Throwable] = None)
trait ExtractValue extends Expression with NullIntolerant
case class AesEncrypt(input: Expression, key: Expression, child: Expression)
case class AesDecrypt(input: Expression, key: Expression, child: Expression)
case class SetCatalogAndNamespace(child: LogicalPlan) extends UnaryCommand
case class CreateFunction(
case class CreateView(
case class SetCatalogCommand(catalogName: String) extends LeafRunnableCommand
case class SetNamespaceCommand(namespace: Seq[String]) extends LeafRunnableCommand
case class ShowCatalogsCommand(pattern: Option[String]) extends LeafRunnableCommand
case class HashedRelationBroadcastMode(key: Seq[Expression], isNullAware: Boolean = false)

AmplabJenkins · 2021-11-04T22:47:50Z

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144906/

SparkQA · 2021-11-05T07:54:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49393/

SparkQA · 2021-11-05T08:54:23Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49393/

AmplabJenkins · 2021-11-05T08:54:26Z

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49393/

cloud-fan · 2021-11-05T09:04:47Z

thanks, merging to master!

SparkQA · 2021-11-05T11:50:46Z

Test build #144921 has finished for PR 29810 at commit e4ed57c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2021-11-05T11:53:02Z

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144921/

tanelk added 9 commits September 17, 2020 22:46

Bitwise operations are commutative

cf6c7e9

Experiment with SQLQueryTestSuite

317f313

Optimizer rules

281ed68

Collect, first and last should be deterministic aggregate functions

326ec05

Fix test, that required non-deterministic expression

ab2901e

Revert "Optimizer rules"

e8badd1

This reverts commit 281ed68

Revert "Experiment with SQLQueryTestSuite"

f1e6711

This reverts commit 317f313

Revert "Bitwise operations are commutative"

c17f2ef

This reverts commit cf6c7e9

Merge branch 'master' into SPARK-32940

3e98b13

# Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparisonSuite.scala

probot-autolabeler bot added the SQL label Sep 19, 2020

Fix test, that required non-deterministic expression

b9fd2f1

tanelk commented Sep 19, 2020

View reviewed changes

HyukjinKwon reviewed Sep 20, 2020

View reviewed changes

tanelk added 2 commits September 20, 2020 08:10

Fix test, that required non-deterministic aggregator

9898c56

Improve docstrings

b0919a2

probot-autolabeler bot added PYTHON R labels Sep 20, 2020

Merge branch 'master' into SPARK-32940_deterministic_agg

a080b53

# Conflicts: # python/pyspark/sql/functions.py # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparisonSuite.scala

github-actions bot added the CORE label Dec 30, 2020

Fix merge

dc6e7c0

Merge branch 'master' into SPARK-32940

56fbf15

Merge remote-tracking branch 'upstream/master' into SPARK-32940

0d40311

tanelk mentioned this pull request Nov 3, 2021

[SPARK-37199][SQL] Add deterministic field to QueryPlan #34470

Closed

cloud-fan reviewed Nov 4, 2021

View reviewed changes

cloud-fan approved these changes Nov 4, 2021

View reviewed changes

Address comments

e4ed57c

cloud-fan closed this in 58e07e0 Nov 5, 2021

tanelk deleted the SPARK-32940 branch November 5, 2021 10:52

nartal1 mentioned this pull request Feb 2, 2022

Make Collect, first and last as deterministic aggregate functions for Spark-3.3 NVIDIA/spark-rapids#4677

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32940][SQL] Collect, first and last should be deterministic aggregate functions #29810

[SPARK-32940][SQL] Collect, first and last should be deterministic aggregate functions #29810

tanelk commented Sep 19, 2020 •

edited

Loading

tanelk Sep 19, 2020

tanelk commented Sep 19, 2020

dongjoon-hyun commented Sep 20, 2020

HyukjinKwon Sep 20, 2020

SparkQA commented Sep 20, 2020

SparkQA commented Sep 20, 2020

cloud-fan commented Sep 21, 2020

tanelk commented Sep 21, 2020

hvanhovell commented Sep 21, 2020

tanelk commented Sep 21, 2020

SparkQA commented Dec 30, 2020

SparkQA commented Dec 31, 2020

SparkQA commented Jun 24, 2021

SparkQA commented Sep 30, 2021

SparkQA commented Sep 30, 2021

SparkQA commented Sep 30, 2021

SparkQA commented Oct 19, 2021

SparkQA commented Oct 19, 2021

SparkQA commented Oct 19, 2021

cloud-fan commented Nov 4, 2021

cloud-fan commented Nov 4, 2021

cloud-fan Nov 4, 2021

cloud-fan Nov 4, 2021

SparkQA commented Nov 4, 2021

SparkQA commented Nov 4, 2021

AmplabJenkins commented Nov 4, 2021

SparkQA commented Nov 4, 2021

AmplabJenkins commented Nov 4, 2021

SparkQA commented Nov 5, 2021

SparkQA commented Nov 5, 2021

AmplabJenkins commented Nov 5, 2021

cloud-fan commented Nov 5, 2021

SparkQA commented Nov 5, 2021

AmplabJenkins commented Nov 5, 2021

[SPARK-32940][SQL] Collect, first and last should be deterministic aggregate functions #29810

[SPARK-32940][SQL] Collect, first and last should be deterministic aggregate functions #29810

Conversation

tanelk commented Sep 19, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

tanelk Sep 19, 2020

Choose a reason for hiding this comment

tanelk commented Sep 19, 2020

dongjoon-hyun commented Sep 20, 2020

HyukjinKwon Sep 20, 2020

Choose a reason for hiding this comment

SparkQA commented Sep 20, 2020

SparkQA commented Sep 20, 2020

cloud-fan commented Sep 21, 2020

tanelk commented Sep 21, 2020

hvanhovell commented Sep 21, 2020

tanelk commented Sep 21, 2020

SparkQA commented Dec 30, 2020

SparkQA commented Dec 31, 2020

SparkQA commented Jun 24, 2021

SparkQA commented Sep 30, 2021

SparkQA commented Sep 30, 2021

SparkQA commented Sep 30, 2021

SparkQA commented Oct 19, 2021

SparkQA commented Oct 19, 2021

SparkQA commented Oct 19, 2021

cloud-fan commented Nov 4, 2021

cloud-fan commented Nov 4, 2021

cloud-fan Nov 4, 2021

Choose a reason for hiding this comment

cloud-fan Nov 4, 2021

Choose a reason for hiding this comment

SparkQA commented Nov 4, 2021

SparkQA commented Nov 4, 2021

AmplabJenkins commented Nov 4, 2021

SparkQA commented Nov 4, 2021

AmplabJenkins commented Nov 4, 2021

SparkQA commented Nov 5, 2021

SparkQA commented Nov 5, 2021

AmplabJenkins commented Nov 5, 2021

cloud-fan commented Nov 5, 2021

SparkQA commented Nov 5, 2021

AmplabJenkins commented Nov 5, 2021

tanelk commented Sep 19, 2020 •

edited

Loading