-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33442][SQL] Change Combine Limit to Eliminate limit using max row #30368
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Show resolved
Hide resolved
Kubernetes integration test starting |
Test build #131054 has finished for PR 30368 at commit
|
Kubernetes integration test status failure |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Show resolved
Hide resolved
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #131066 has finished for PR 30368 at commit
|
Test build #131068 has finished for PR 30368 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #131092 has finished for PR 30368 at commit
|
@@ -1,5 +1,5 @@ | |||
== Physical Plan == | |||
TakeOrderedAndProject (34) | |||
* Sort (34) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It changes from TakeOrderedAndProject
to Sort
seems because Limit
after Sort
is removed?
It might have additional shuffle for global Sort
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This q92 sql:
SELECT sum(ws_ext_discount_amt) AS `Excess Discount Amount `
FROM web_sales, item, date_dim
WHERE i_manufact_id = 350
AND i_item_sk = ws_item_sk
AND d_date BETWEEN '2000-01-27' AND (cast('2000-01-27' AS DATE) + INTERVAL 90 days)
AND d_date_sk = ws_sold_date_sk
AND ws_ext_discount_amt >
(
SELECT 1.3 * avg(ws_ext_discount_amt)
FROM web_sales, date_dim
WHERE ws_item_sk = i_item_sk
AND d_date BETWEEN '2000-01-27' AND (cast('2000-01-27' AS DATE) + INTERVAL 90 days)
AND d_date_sk = ws_sold_date_sk
)
ORDER BY sum(ws_ext_discount_amt)
LIMIT 100
yes, Limit
after Sort
is a special case, we will convert to TakeOrderedAndProject
, but it seems not necessary to do both sort
and limit
if child maxRow == 1. Maybe we can do an another check seems like if sort.child.maxRow <= 1 then remove sort
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other thought, we can add this pattern
case sort @ Sort(order, true, child)
if sort.maxRow < conf.topKSortFallbackThreshold =>
TakeOrderedAndProjectExec()
In this way, we can infer the TakeOrderedAndProjectExec
from Sort
which has not Limit
after.
What do you think about this? @maropu @viirya @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, @viirya , nice catch! Yea, how about simply excluding the case in the EliminateLimits
rule? @ulysses-you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #131104 has finished for PR 30368 at commit
|
Test build #131160 has finished for PR 30368 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #131182 has finished for PR 30368 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #131199 has finished for PR 30368 at commit
|
Test build #131197 has finished for PR 30368 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #131225 has finished for PR 30368 at commit
|
Hi @ulysses-you , can you put the conclusion of #30368 (comment) in PR description, to mention that we may end up replacing |
@cloud-fan updated the description. |
thanks, merging to master! |
thanks for merging! |
What changes were proposed in this pull request?
Change
CombineLimits
name toEliminateLimits
and add check ifLimit
child max row <= limit.Why are the changes needed?
In Add-hoc scene, we always add limit for the query if user have no special limit value, but not all limit is nesessary.
A general negative example is
It will be great if we can eliminate limit at Spark side.
Also, we make a benchmark for this case
and the result is
It shows that it makes sense to replace
TakeOrderedAndProjectExec
withSort + Project
.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Add test.