Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Simplify the aggregation logic #3194

Closed
revans2 opened this issue Aug 11, 2021 · 2 comments · Fixed by #3910
Closed

[FEA] Simplify the aggregation logic #3194

revans2 opened this issue Aug 11, 2021 · 2 comments · Fixed by #3910
Assignees
Labels
task Work required that improves the product but is not user facing

Comments

@revans2
Copy link
Collaborator

revans2 commented Aug 11, 2021

This is pure tech debt and is a follow on from #3178 (comment). The aggregation code is horribly complicated and is doing lots of special cases that the open source Spark code is not doing. I am not 100% sure why we have done all of this special case work. A lot of it feels like setupReferences is part of the issue and we need to do a deeper dive into it.

As a high level explanation of how this is supposed to work.

Each aggregation comes with a mode. Each mode tells the aggregation what to do as a part of that stage. Originally the code assumed that there would only ever be one mode for all of the aggregations. I thought we had ripped that all out and each aggregation does the right thing.

To successfully do an aggregation there are a few steps used.

  1. Initial projection to get input data setup properly.
  2. Initial aggregation to produce partial result(s)
  3. Merge aggregation to combine partial results (This requires that the input schema and the output schema be the same)
  4. Final projection to take the combined partial results and produce a final result.

In general the steps take the pattern 1, 2, 3*, 4. Which means 1, 2 and 4 are required and step 3 can be done as often as needed because the input and output schemas are the same.

Step 4 requires that all of the data for a given group by key is on the same task and has been merged into a single output row. There are several different ways to do this, which is why we end up with several aggregation modes.

  • Partial mode means that we do Step 1 and Step 2. Then we can do Step 3 as many times as needed depending on how we are doing memory management, and how many batches are needed.
  • PartialMerge mode means we can do Step 3 at least once and possibly more times depending on how we are doing memory management and how many batches are needed.
  • Final mode means that we do the same steps as with PartialMerge but do Step 4 when we are done doing the partial merges.
  • Complete mode is something only Databricks does, but it essentially means we do Step 1, Step 2, Step 3* (depending on memory management requirements), and Step 4 all at once.

The majority of the complexity appears to be in setupReferences where it tries to match up the data coming into the AggregateExec with what each aggregation wants at a given stage.

I think @abellina said it well

I think the main ask is to not ... assum{e}... that a hash aggregate exec has a certain shape. If this function could decide per aggregate expression mode what the right binding should be, it should be more robust to new aggregate exec setups that mix and match modes (if we encounter new ones). ...

@revans2 revans2 added ? - Needs Triage Need team to review and classify task Work required that improves the product but is not user facing labels Aug 11, 2021
@abellina abellina self-assigned this Aug 11, 2021
@Salonijain27 Salonijain27 removed the ? - Needs Triage Need team to review and classify label Aug 17, 2021
@abellina
Copy link
Collaborator

Related issue: #3442

@abellina
Copy link
Collaborator

abellina commented Sep 14, 2021

PR #3417 helped move this issue forward, at least in figuring out the direction we want to take, see: #3417 (comment).

I am working on the positional changes here: https://github.com/abellina/spark-rapids/tree/agg/pre_post_steps_positional, and will merge up with the setupReferences cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants