[FEA] Batch sort before expand for some distinct aggregations #10560

revans2 · 2024-03-07T14:57:02Z

Is your feature request related to a problem? Please describe.
When doing more than one distinct aggregation Spark will currently insert in an ExpandExec followed by two Aggregation passes. This is to let us do the distinct aggregations at the same time as the non-distinct aggregations. It gets a little complicated. In some cases our HashAggregate actually ends up being a sort aggregation inside of CUDF because hash aggregations only work on a small set of app/type combinations. It would be really great if we could have a way for HashAggregate to indicate that the aggregations that are going to be done would result in a sort based aggregation. Then from that we could have upstream operators, like GpuExpandExec, recognize this and optionally sort the input batches (not the full input data) so that it satisfies the desired ordering. Then we could have a way for it to signal to GpuHashAggregate that the data is sorted by batches, which would let it avoid doing the sort in CUDF all together.

revans2 added ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Mar 7, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Mar 7, 2024

winningsix mentioned this issue May 13, 2024

[FEA] Optimize count distinct performance optimization with null columns reuse and post expand coalesce #10799

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Batch sort before expand for some distinct aggregations #10560

[FEA] Batch sort before expand for some distinct aggregations #10560

revans2 commented Mar 7, 2024

[FEA] Batch sort before expand for some distinct aggregations #10560

[FEA] Batch sort before expand for some distinct aggregations #10560

Comments

revans2 commented Mar 7, 2024