Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33442][SQL] Change Combine Limit to Eliminate limit using max row #30368

Closed
wants to merge 17 commits into from

Conversation

ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Nov 13, 2020

What changes were proposed in this pull request?

Change CombineLimits name to EliminateLimits and add check if Limit child max row <= limit.

Why are the changes needed?

In Add-hoc scene, we always add limit for the query if user have no special limit value, but not all limit is nesessary.

A general negative example is

select count(*) from t limit 100000;

It will be great if we can eliminate limit at Spark side.

Also, we make a benchmark for this case

runBenchmark("Sort and Limit") {
  val N = 100000
  val benchmark = new Benchmark("benchmark sort and limit", N)

  benchmark.addCase("TakeOrderedAndProject", 3) { _ =>
    spark.range(N).toDF("c").repartition(200).sort("c").take(200000)
  }

  benchmark.addCase("Sort And Limit", 3) { _ =>
    withSQLConf("spark.sql.execution.topKSortFallbackThreshold" -> "-1") {
      spark.range(N).toDF("c").repartition(200).sort("c").take(200000)
    }
  }

  benchmark.addCase("Sort", 3) { _ =>
    spark.range(N).toDF("c").repartition(200).sort("c").collect()
  }
  benchmark.run()
}

and the result is

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.15.6
Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
benchmark sort and limit:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
TakeOrderedAndProject                              1833           2259         382          0.1       18327.1       1.0X
Sort And Limit                                     1417           1658         285          0.1       14167.5       1.3X
Sort                                               1324           1484         225          0.1       13238.3       1.4X

It shows that it makes sense to replace TakeOrderedAndProjectExec with Sort + Project.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add test.

@github-actions github-actions bot added the SQL label Nov 13, 2020
@SparkQA
Copy link

SparkQA commented Nov 13, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35659/

@SparkQA
Copy link

SparkQA commented Nov 13, 2020

Test build #131054 has finished for PR 30368 at commit 67fd737.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35659/

@SparkQA
Copy link

SparkQA commented Nov 13, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35670/

@SparkQA
Copy link

SparkQA commented Nov 13, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35670/

@SparkQA
Copy link

SparkQA commented Nov 13, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35672/

@SparkQA
Copy link

SparkQA commented Nov 13, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35672/

@SparkQA
Copy link

SparkQA commented Nov 13, 2020

Test build #131066 has finished for PR 30368 at commit 0a89334.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2020

Test build #131068 has finished for PR 30368 at commit 2c6b171.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 14, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35695/

@SparkQA
Copy link

SparkQA commented Nov 14, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35695/

@SparkQA
Copy link

SparkQA commented Nov 14, 2020

Test build #131092 has finished for PR 30368 at commit c167a0d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1,5 +1,5 @@
== Physical Plan ==
TakeOrderedAndProject (34)
* Sort (34)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It changes from TakeOrderedAndProject to Sort seems because Limit after Sort is removed?

It might have additional shuffle for global Sort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This q92 sql:

SELECT sum(ws_ext_discount_amt) AS `Excess Discount Amount `
FROM web_sales, item, date_dim
WHERE i_manufact_id = 350
  AND i_item_sk = ws_item_sk
  AND d_date BETWEEN '2000-01-27' AND (cast('2000-01-27' AS DATE) + INTERVAL 90 days)
  AND d_date_sk = ws_sold_date_sk
  AND ws_ext_discount_amt >
  (
    SELECT 1.3 * avg(ws_ext_discount_amt)
    FROM web_sales, date_dim
    WHERE ws_item_sk = i_item_sk
      AND d_date BETWEEN '2000-01-27' AND (cast('2000-01-27' AS DATE) + INTERVAL 90 days)
      AND d_date_sk = ws_sold_date_sk
  )
ORDER BY sum(ws_ext_discount_amt)
LIMIT 100

yes, Limit after Sort is a special case, we will convert to TakeOrderedAndProject, but it seems not necessary to do both sort and limit if child maxRow == 1. Maybe we can do an another check seems like if sort.child.maxRow <= 1 then remove sort ?

Copy link
Contributor Author

@ulysses-you ulysses-you Nov 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other thought, we can add this pattern

case sort @ Sort(order, true, child)
          if sort.maxRow < conf.topKSortFallbackThreshold =>
        TakeOrderedAndProjectExec()

In this way, we can infer the TakeOrderedAndProjectExec from Sort which has not Limit after.

What do you think about this? @maropu @viirya @cloud-fan

Copy link
Member

@maropu maropu Nov 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, @viirya , nice catch! Yea, how about simply excluding the case in the EliminateLimits rule? @ulysses-you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@SparkQA
Copy link

SparkQA commented Nov 15, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35707/

@SparkQA
Copy link

SparkQA commented Nov 15, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35707/

@SparkQA
Copy link

SparkQA commented Nov 15, 2020

Test build #131104 has finished for PR 30368 at commit ff1ab01.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Test build #131160 has finished for PR 30368 at commit 4d45c0a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35763/

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35763/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35784/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35784/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Test build #131182 has finished for PR 30368 at commit 2f8f139.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35800/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35800/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35802/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35802/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Test build #131199 has finished for PR 30368 at commit 45ab8b5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Test build #131197 has finished for PR 30368 at commit 9fb4039.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35828/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35828/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Test build #131225 has finished for PR 30368 at commit 2f55ec0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Hi @ulysses-you , can you put the conclusion of #30368 (comment) in PR description, to mention that we may end up replacing TakeOrderedAndProjectExec with Sort + Project, which is faster?

@ulysses-you
Copy link
Contributor Author

@cloud-fan updated the description.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 21b1350 Nov 19, 2020
@ulysses-you
Copy link
Contributor Author

thanks for merging!

@ulysses-you ulysses-you deleted the SPARK-33442 branch March 3, 2021 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants