[FEA] Support HashAggregate on struct and nested struct #2877

viadea · 2021-07-07T01:31:34Z

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I wish the RAPIDS Accelerator for Apache Spark would [...]
Support HashAggregate on struct and nested struct.

Currently a group-by on struct can not work on GPU:

  !Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient
    @Partitioning <HashPartitioning> could run on GPU
      @Expression <AttributeReference> name#16 could run on GPU
    !Exec <HashAggregateExec> cannot run on GPU because Nested types in grouping expressions are not supported

Describe the solution you'd like
A clear and concise description of what you want to happen.
GpuHashAggregate can work on nested struct.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

The text was updated successfully, but these errors were encountered:

revans2 · 2021-07-07T13:17:57Z

I would like to get some clarification on the requirements.

Is this only for the group by keys, or are there requirements for specific aggregations that also need to support nested structs.
What types are allowed in the structs? Specifically are Lists/Arrays allowed to be in the structs? If so then this is going to be much harder, because cudf will likely push back on that.
Do you need a single level for a struct or do you need multiple level deep?

viadea · 2021-07-07T17:14:25Z

I would like to get some clarification on the requirements.

Is this only for the group by keys, or are there requirements for specific aggregations that also need to support nested structs.
What types are allowed in the structs? Specifically are Lists/Arrays allowed to be in the structs? If so then this is going to be much harder, because cudf will likely push back on that.
Do you need a single level for a struct or do you need multiple level deep?

Here are 2 scenarios to trigger this HashAggregate on struct and/or nested struct.

Union on single-level struct + nested struct(2-levels)
Currently in 21.08 snapshot, GpuUnion does work on struct and also nested struct per my tests.
However since there is a Union will do de-duplicate, so HashAggregate after the GpuUnion will fallback on CPU.
So the firstly use case is for Union, and the the minimum reproduce is:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val data = Seq(
    Row(Row("Adam ","","Green"),"1","M",1000),
    Row(Row("Bob ","Middle","Green"),"2","M",2000),
    Row(Row("Cathy ","","Green"),"3","F",3000)
)

val schema = (new StructType()
  .add("name",new StructType()
    .add("firstname",StringType)
    .add("middlename",StringType)
    .add("lastname",StringType)) 
  .add("id",StringType)
  .add("gender",StringType)
  .add("salary",IntegerType))

val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.write.format("parquet").mode("overwrite").save("/tmp/testparquet")

val df2 = spark.read.parquet("/tmp/testparquet")
val df3 = spark.read.parquet("/tmp/testparquet")

df2.createOrReplaceTempView("df2")
df3.createOrReplaceTempView("df3")

val nested_struct_df=spark.sql("select struct(name, struct(name.firstname, name.lastname) as newname) as col,name from df2")
nested_struct_df.createOrReplaceTempView("nested_struct_df")

//Single-level union falls back on CPU due to HashAggregate
val single_level_struct_union_df=spark.sql("select name from df2 union select name from df3")
single_level_struct_union_df.show
single_level_struct_union_df.explain

//Two-level union falls back on CPU due to HashAggregate
val two_level_struct_union_df=spark.sql("select col from nested_struct_df union select col from nested_struct_df")
two_level_struct_union_df.show
two_level_struct_union_df.explain

HashAggregate on the single-level struct with a function as count
Minimum reproduce is:

spark.sql("select count(*) from df2 group by name").show
spark.sql("select count(*) from df2 group by name").explain

mythrocks · 2021-08-09T17:14:10Z

I'm working on the libcudf side of this right now. I hope to have a PR for this later this week.

sameerz · 2021-08-27T05:02:09Z

Related libcudf PR: rapidsai/cudf#9024

viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jul 7, 2021

viadea changed the title ~~[FEA] Support HashAggregate on nested struct~~ [FEA] Support HashAggregate on struct and nested struct Jul 7, 2021

sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Jul 13, 2021

mythrocks self-assigned this Jul 14, 2021

Salonijain27 removed the ? - Needs Triage Need team to review and classify label Jul 27, 2021

sperlingxx self-assigned this Aug 3, 2021

sperlingxx mentioned this issue Aug 31, 2021

Support HashAggregate on struct and nested struct #3354

Merged

revans2 closed this as completed in #3354 Sep 1, 2021

NVnavkumar mentioned this issue Feb 1, 2022

Allow GpuWindowExec to partition on structs #4673

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support HashAggregate on struct and nested struct #2877

[FEA] Support HashAggregate on struct and nested struct #2877

viadea commented Jul 7, 2021 •

edited

Loading

revans2 commented Jul 7, 2021

viadea commented Jul 7, 2021 •

edited

Loading

mythrocks commented Aug 9, 2021

sameerz commented Aug 27, 2021

[FEA] Support HashAggregate on struct and nested struct #2877

[FEA] Support HashAggregate on struct and nested struct #2877

Comments

viadea commented Jul 7, 2021 • edited Loading

revans2 commented Jul 7, 2021

viadea commented Jul 7, 2021 • edited Loading

mythrocks commented Aug 9, 2021

sameerz commented Aug 27, 2021

viadea commented Jul 7, 2021 •

edited

Loading

viadea commented Jul 7, 2021 •

edited

Loading