Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Batch sort before expand for some distinct aggregations #10560

Open
revans2 opened this issue Mar 7, 2024 · 0 comments
Open

[FEA] Batch sort before expand for some distinct aggregations #10560

revans2 opened this issue Mar 7, 2024 · 0 comments
Labels
performance A performance related task/issue

Comments

@revans2
Copy link
Collaborator

revans2 commented Mar 7, 2024

Is your feature request related to a problem? Please describe.
When doing more than one distinct aggregation Spark will currently insert in an ExpandExec followed by two Aggregation passes. This is to let us do the distinct aggregations at the same time as the non-distinct aggregations. It gets a little complicated. In some cases our HashAggregate actually ends up being a sort aggregation inside of CUDF because hash aggregations only work on a small set of app/type combinations. It would be really great if we could have a way for HashAggregate to indicate that the aggregations that are going to be done would result in a sort based aggregation. Then from that we could have upstream operators, like GpuExpandExec, recognize this and optionally sort the input batches (not the full input data) so that it satisfies the desired ordering. Then we could have a way for it to signal to GpuHashAggregate that the data is sorted by batches, which would let it avoid doing the sort in CUDF all together.

@revans2 revans2 added ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Mar 7, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

2 participants