Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support First() in windowing context with Integer type #4005

Closed
viadea opened this issue Nov 2, 2021 · 8 comments
Closed

[FEA] Support First() in windowing context with Integer type #4005

viadea opened this issue Nov 2, 2021 · 8 comments
Assignees
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request P1 Nice to have for release

Comments

@viadea
Copy link
Collaborator

viadea commented Nov 2, 2021

Is your feature request related to a problem? Please describe.
This is a feature request to support first() function in windowing context with double type.

For example:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val data = Seq(
    Row(Row("Adam ","","Green"),"1","M",1000, "2019-01-01",List("Java","Scala")),
    Row(Row("Bob ","Middle","Green"),"2","M",2000, "2019-01-02",List("Java","Python")),
    Row(Row("Cathy ","","Green"),"3","F",3000, "2019-01-03",List())
)

val schema = (new StructType()
  .add("name",new StructType()
    .add("firstname",StringType)
    .add("middlename",StringType)
    .add("lastname",StringType)) 
  .add("id",StringType)
  .add("gender",StringType)
  .add("salary",IntegerType)
  .add("birthdayStr",StringType)
  .add("language",ArrayType(StringType))
             )

val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.withColumn("birthday", to_date(col("birthdayStr"))).write.format("parquet").mode("overwrite").save("/tmp/testparquet")
val df2 = spark.read.parquet("/tmp/testparquet")
df2.createOrReplaceTempView("df2")
df2.printSchema

sql("""SELECT gender, first(salary,true) OVER (PARTITION BY gender ORDER BY salary) FROM df2""").collect


@viadea viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Nov 2, 2021
@Salonijain27 Salonijain27 added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Nov 2, 2021
@res-life res-life self-assigned this Nov 3, 2021
@viadea
Copy link
Collaborator Author

viadea commented Nov 4, 2021

Driver log message:

      !Expression <WindowExpression> first(salary#174, true) windowspecdefinition(gender#173, salary#174 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) cannot run on GPU because the type of orderBy column is not supported in a window range function, found DoubleType

@res-life res-life added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Nov 10, 2021
@res-life
Copy link
Collaborator

res-life commented Nov 10, 2021

Actually not support first function in window context.
CUDF does not support this currently, please see rapidsai/cudf#9643

@res-life
Copy link
Collaborator

@viadea Order by in window only supports int and timestamp, does not supports double float, it's a constraint of CUDF.
Like Count does not support double, you can test:
SELECT gender, count(salary) OVER (PARTITION BY gender ORDER BY salary) FROM df2
So this issue will be: Support First() in windowing context with int and timestamp.

@revans2
Copy link
Collaborator

revans2 commented Nov 10, 2021

All of these are going to require some CUDF changes to work.

nth_element_aggregation is not supported as a rolling_aggregation in cudf so we cannot implement it in a way that will be able to support null checks like the query above wants.

Cudf does not support range queries on floats or doubles yet. We can add this in though.

@viadea
Copy link
Collaborator Author

viadea commented Nov 16, 2021

@res-life I understand there are limitations on cuDF side. That is fine.
But the issue should not be changed because that is based on a customer's log analysis.
In that, we need to support double, that is the feature request.

@res-life
Copy link
Collaborator

res-life commented Jul 25, 2022

Note:
Cudf does not support range queries on floats or doubles yet.
Still depends on cuDF: #6000 (comment)

@res-life res-life changed the title [FEA] Support First() in windowing context with double type [FEA] Support First() in windowing context with Integer type Aug 2, 2022
@res-life
Copy link
Collaborator

res-life commented Aug 2, 2022

Split this issue into 2, one for Integer, one for Double #6192

@res-life
Copy link
Collaborator

res-life commented Aug 2, 2022

#6000 already supported Integer type.

@res-life res-life closed this as completed Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request P1 Nice to have for release
Projects
None yet
Development

No branches or pull requests

4 participants