Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] provide typical aggregation patterns for different spark version/flavor #3437

Open
sperlingxx opened this issue Sep 10, 2021 · 2 comments
Labels
P1 Nice to have for release task Work required that improves the product but is not user facing

Comments

@sperlingxx
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
We refactored hashAggReplaceMode in #3368, extending its ability to express complicated aggregation patterns. However, the change made these patterns harder to understand. As @abellina suggested, it would be nice if we can list all typical aggregation patterns for different spark versions/flavors, along with descriptions and illustrations.

@sperlingxx sperlingxx added feature request New feature or request ? - Needs Triage Need team to review and classify labels Sep 10, 2021
@sameerz
Copy link
Collaborator

sameerz commented Sep 14, 2021

This should be resolved when #3194 is resolved. If 3914 does not get resolved in a timely fashion, we should come back to this and address it.

@sameerz sameerz added task Work required that improves the product but is not user facing and removed feature request New feature or request ? - Needs Triage Need team to review and classify labels Sep 14, 2021
@abellina
Copy link
Collaborator

abellina commented Sep 14, 2021

This should be resolved when #3194 is resolved. If 3914 does not get resolved in a timely fashion, we should come back to this and address it.

This issue is orthogonal to #3194. The patterns that @sperlingxx is talking about here are patterns to denote what an aggregate exec will look like for the tests only. For example, databricks may use the Complete in some cases, where Apache Spark will treat the aggregate differently, and if we are trying to test that the GPU aggregate can take and produce compatible output with the CPU, we need to be able to address each flavor of the hash aggregate plans. For example (stolen from one the tests @sperlingxx had):

_replace_modes_single_distinct = [
    # Spark: CPU -> CPU -> GPU(PartialMerge) -> GPU(Partial)
    # Databricks runtime: CPU(Final and Complete) -> GPU(PartialMerge)
    'partial|partialMerge',
    # Spark: GPU(Final) -> GPU(PartialMerge&Partial) -> CPU(PartialMerge) -> CPU(Partial)
    # Databricks runtime: GPU(Final&Complete) -> CPU(PartialMerge)
    'final|partialMerge&partial|final&complete',
]

So in this case, we want to keep on the GPU:

  • First case: PartialMerge (databricks) and Partial (apache)
  • Second case: Final or PartialMerge&Partial (for apache), and Final&Complete (databricks).

And the rest of the aggregate executes on the CPU, which is great as we can show in the tests we can be compatible if part of the plan needs to execute on the CPU due to some operation we don't support yet.

The patterns are a bit convoluted here, and you have to go through the comments to understand what's going on. The proposal is to at least try and associate each pattern with a flavor of Spark, but ideally we can find some common patterns that can be prebaked and documented so we don't have to read a bunch of comments each time.

@GaryShen2008 GaryShen2008 added the P1 Nice to have for release label Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Nice to have for release task Work required that improves the product but is not user facing
Projects
None yet
Development

No branches or pull requests

4 participants