Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support HashAggregate on struct and nested struct #2877

Closed
viadea opened this issue Jul 7, 2021 · 4 comments · Fixed by #3354
Closed

[FEA] Support HashAggregate on struct and nested struct #2877

viadea opened this issue Jul 7, 2021 · 4 comments · Fixed by #3354
Assignees
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request

Comments

@viadea
Copy link
Collaborator

viadea commented Jul 7, 2021

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I wish the RAPIDS Accelerator for Apache Spark would [...]
Support HashAggregate on struct and nested struct.

Currently a group-by on struct can not work on GPU:

  !Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient
    @Partitioning <HashPartitioning> could run on GPU
      @Expression <AttributeReference> name#16 could run on GPU
    !Exec <HashAggregateExec> cannot run on GPU because Nested types in grouping expressions are not supported

Describe the solution you'd like
A clear and concise description of what you want to happen.
GpuHashAggregate can work on nested struct.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

@viadea viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jul 7, 2021
@viadea viadea changed the title [FEA] Support HashAggregate on nested struct [FEA] Support HashAggregate on struct and nested struct Jul 7, 2021
@revans2
Copy link
Collaborator

revans2 commented Jul 7, 2021

I would like to get some clarification on the requirements.

Is this only for the group by keys, or are there requirements for specific aggregations that also need to support nested structs.
What types are allowed in the structs? Specifically are Lists/Arrays allowed to be in the structs? If so then this is going to be much harder, because cudf will likely push back on that.
Do you need a single level for a struct or do you need multiple level deep?

@viadea
Copy link
Collaborator Author

viadea commented Jul 7, 2021

I would like to get some clarification on the requirements.

Is this only for the group by keys, or are there requirements for specific aggregations that also need to support nested structs.
What types are allowed in the structs? Specifically are Lists/Arrays allowed to be in the structs? If so then this is going to be much harder, because cudf will likely push back on that.
Do you need a single level for a struct or do you need multiple level deep?

Here are 2 scenarios to trigger this HashAggregate on struct and/or nested struct.

  1. Union on single-level struct + nested struct(2-levels)
    Currently in 21.08 snapshot, GpuUnion does work on struct and also nested struct per my tests.
    However since there is a Union will do de-duplicate, so HashAggregate after the GpuUnion will fallback on CPU.
    So the firstly use case is for Union, and the the minimum reproduce is:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val data = Seq(
    Row(Row("Adam ","","Green"),"1","M",1000),
    Row(Row("Bob ","Middle","Green"),"2","M",2000),
    Row(Row("Cathy ","","Green"),"3","F",3000)
)

val schema = (new StructType()
  .add("name",new StructType()
    .add("firstname",StringType)
    .add("middlename",StringType)
    .add("lastname",StringType)) 
  .add("id",StringType)
  .add("gender",StringType)
  .add("salary",IntegerType))

val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.write.format("parquet").mode("overwrite").save("/tmp/testparquet")

val df2 = spark.read.parquet("/tmp/testparquet")
val df3 = spark.read.parquet("/tmp/testparquet")

df2.createOrReplaceTempView("df2")
df3.createOrReplaceTempView("df3")

val nested_struct_df=spark.sql("select struct(name, struct(name.firstname, name.lastname) as newname) as col,name from df2")
nested_struct_df.createOrReplaceTempView("nested_struct_df")

//Single-level union falls back on CPU due to HashAggregate
val single_level_struct_union_df=spark.sql("select name from df2 union select name from df3")
single_level_struct_union_df.show
single_level_struct_union_df.explain

//Two-level union falls back on CPU due to HashAggregate
val two_level_struct_union_df=spark.sql("select col from nested_struct_df union select col from nested_struct_df")
two_level_struct_union_df.show
two_level_struct_union_df.explain
  1. HashAggregate on the single-level struct with a function as count
    Minimum reproduce is:
spark.sql("select count(*) from df2 group by name").show
spark.sql("select count(*) from df2 group by name").explain

@sameerz sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Jul 13, 2021
@mythrocks mythrocks self-assigned this Jul 14, 2021
@Salonijain27 Salonijain27 removed the ? - Needs Triage Need team to review and classify label Jul 27, 2021
@sperlingxx sperlingxx self-assigned this Aug 3, 2021
@mythrocks
Copy link
Collaborator

I'm working on the libcudf side of this right now. I hope to have a PR for this later this week.

@sameerz
Copy link
Collaborator

sameerz commented Aug 27, 2021

Related libcudf PR: rapidsai/cudf#9024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants