[FEA] Simplify the aggregation logic #3194

revans2 · 2021-08-11T13:22:18Z

This is pure tech debt and is a follow on from #3178 (comment). The aggregation code is horribly complicated and is doing lots of special cases that the open source Spark code is not doing. I am not 100% sure why we have done all of this special case work. A lot of it feels like setupReferences is part of the issue and we need to do a deeper dive into it.

As a high level explanation of how this is supposed to work.

Each aggregation comes with a mode. Each mode tells the aggregation what to do as a part of that stage. Originally the code assumed that there would only ever be one mode for all of the aggregations. I thought we had ripped that all out and each aggregation does the right thing.

To successfully do an aggregation there are a few steps used.

Initial projection to get input data setup properly.
Initial aggregation to produce partial result(s)
Merge aggregation to combine partial results (This requires that the input schema and the output schema be the same)
Final projection to take the combined partial results and produce a final result.

In general the steps take the pattern 1, 2, 3*, 4. Which means 1, 2 and 4 are required and step 3 can be done as often as needed because the input and output schemas are the same.

Step 4 requires that all of the data for a given group by key is on the same task and has been merged into a single output row. There are several different ways to do this, which is why we end up with several aggregation modes.

Partial mode means that we do Step 1 and Step 2. Then we can do Step 3 as many times as needed depending on how we are doing memory management, and how many batches are needed.
PartialMerge mode means we can do Step 3 at least once and possibly more times depending on how we are doing memory management and how many batches are needed.
Final mode means that we do the same steps as with PartialMerge but do Step 4 when we are done doing the partial merges.
Complete mode is something only Databricks does, but it essentially means we do Step 1, Step 2, Step 3* (depending on memory management requirements), and Step 4 all at once.

The majority of the complexity appears to be in setupReferences where it tries to match up the data coming into the AggregateExec with what each aggregation wants at a given stage.

I think @abellina said it well

I think the main ask is to not ... assum{e}... that a hash aggregate exec has a certain shape. If this function could decide per aggregate expression mode what the right binding should be, it should be more robust to new aggregate exec setups that mix and match modes (if we encounter new ones). ...

The text was updated successfully, but these errors were encountered:

abellina · 2021-09-14T17:01:51Z

Related issue: #3442

abellina · 2021-09-14T17:04:13Z

PR #3417 helped move this issue forward, at least in figuring out the direction we want to take, see: #3417 (comment).

I am working on the positional changes here: https://github.com/abellina/spark-rapids/tree/agg/pre_post_steps_positional, and will merge up with the setupReferences cleanup.

revans2 added ? - Needs Triage Need team to review and classify task Work required that improves the product but is not user facing labels Aug 11, 2021

abellina self-assigned this Aug 11, 2021

Salonijain27 removed the ? - Needs Triage Need team to review and classify label Aug 17, 2021

abellina mentioned this issue Sep 14, 2021

Adds pre/post steps for merge and update aggregate #3417

Merged

sameerz mentioned this issue Sep 14, 2021

[FEA] provide typical aggregation patterns for different spark version/flavor #3437

Open

abellina mentioned this issue Sep 15, 2021

Fix GpuSum type to match resultType #3500

Merged

Salonijain27 added this to the Sep 27 - Oct 1 milestone Sep 24, 2021

sameerz modified the milestones: Sep 27 - Oct 1, Oct 4 - Oct 15 Oct 8, 2021

abellina mentioned this issue Oct 15, 2021

Remove the special reduction cast in hash aggregate #3833

Closed

sameerz modified the milestones: Oct 4 - Oct 15, Oct 18 - Oct 29 Oct 15, 2021

abellina mentioned this issue Oct 25, 2021

Refactor the aggregate API [databricks] #3910

Merged

abellina closed this as completed in #3910 Oct 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Simplify the aggregation logic #3194

[FEA] Simplify the aggregation logic #3194

revans2 commented Aug 11, 2021

abellina commented Sep 14, 2021

abellina commented Sep 14, 2021 •

edited

Loading

[FEA] Simplify the aggregation logic #3194

[FEA] Simplify the aggregation logic #3194

Comments

revans2 commented Aug 11, 2021

abellina commented Sep 14, 2021

abellina commented Sep 14, 2021 • edited Loading

abellina commented Sep 14, 2021 •

edited

Loading