[FEA] provide typical aggregation patterns for different spark version/flavor #3437

sperlingxx · 2021-09-10T03:43:58Z

Is your feature request related to a problem? Please describe.
We refactored hashAggReplaceMode in #3368, extending its ability to express complicated aggregation patterns. However, the change made these patterns harder to understand. As @abellina suggested, it would be nice if we can list all typical aggregation patterns for different spark versions/flavors, along with descriptions and illustrations.

The text was updated successfully, but these errors were encountered:

sameerz · 2021-09-14T20:45:26Z

This should be resolved when #3194 is resolved. If 3914 does not get resolved in a timely fashion, we should come back to this and address it.

abellina · 2021-09-14T21:00:04Z

This should be resolved when #3194 is resolved. If 3914 does not get resolved in a timely fashion, we should come back to this and address it.

This issue is orthogonal to #3194. The patterns that @sperlingxx is talking about here are patterns to denote what an aggregate exec will look like for the tests only. For example, databricks may use the Complete in some cases, where Apache Spark will treat the aggregate differently, and if we are trying to test that the GPU aggregate can take and produce compatible output with the CPU, we need to be able to address each flavor of the hash aggregate plans. For example (stolen from one the tests @sperlingxx had):

_replace_modes_single_distinct = [
    # Spark: CPU -> CPU -> GPU(PartialMerge) -> GPU(Partial)
    # Databricks runtime: CPU(Final and Complete) -> GPU(PartialMerge)
    'partial|partialMerge',
    # Spark: GPU(Final) -> GPU(PartialMerge&Partial) -> CPU(PartialMerge) -> CPU(Partial)
    # Databricks runtime: GPU(Final&Complete) -> CPU(PartialMerge)
    'final|partialMerge&partial|final&complete',
]

So in this case, we want to keep on the GPU:

First case: PartialMerge (databricks) and Partial (apache)
Second case: Final or PartialMerge&Partial (for apache), and Final&Complete (databricks).

And the rest of the aggregate executes on the CPU, which is great as we can show in the tests we can be compatible if part of the plan needs to execute on the CPU due to some operation we don't support yet.

The patterns are a bit convoluted here, and you have to go through the comments to understand what's going on. The proposal is to at least try and associate each pattern with a flavor of Spark, but ideally we can find some common patterns that can be prebaked and documented so we don't have to read a bunch of comments each time.

sperlingxx added feature request New feature or request ? - Needs Triage Need team to review and classify labels Sep 10, 2021

sperlingxx mentioned this issue Sep 10, 2021

Extend TagForReplaceMode to adapt Databricks runtime #3368

Merged

sameerz added task Work required that improves the product but is not user facing and removed feature request New feature or request ? - Needs Triage Need team to review and classify labels Sep 14, 2021

GaryShen2008 added the P1 Nice to have for release label Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] provide typical aggregation patterns for different spark version/flavor #3437

[FEA] provide typical aggregation patterns for different spark version/flavor #3437

sperlingxx commented Sep 10, 2021

sameerz commented Sep 14, 2021

abellina commented Sep 14, 2021 •

edited

Loading

[FEA] provide typical aggregation patterns for different spark version/flavor #3437

[FEA] provide typical aggregation patterns for different spark version/flavor #3437

Comments

sperlingxx commented Sep 10, 2021

sameerz commented Sep 14, 2021

abellina commented Sep 14, 2021 • edited Loading

abellina commented Sep 14, 2021 •

edited

Loading