Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] support ConcatWs sql function #63

Closed
revans2 opened this issue May 29, 2020 · 8 comments · Fixed by #2479
Closed

[FEA] support ConcatWs sql function #63

revans2 opened this issue May 29, 2020 · 8 comments · Fixed by #2479
Assignees
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request P0 Must have for release SQL part of the SQL/Dataframe plugin

Comments

@revans2
Copy link
Collaborator

revans2 commented May 29, 2020

Is your feature request related to a problem? Please describe.
it would be great to support the concat_ws SQL function

rapidsai/cudf#3726 was filed to get support from cudf.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify SQL part of the SQL/Dataframe plugin labels May 29, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Sep 1, 2020
@sameerz sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Feb 18, 2021
@sameerz
Copy link
Collaborator

sameerz commented Mar 30, 2021

Follow up work on cudf, concatenating arrays of strings: rapidsai/cudf#7727

@razajafri razajafri self-assigned this Apr 9, 2021
@sameerz sameerz added this to the Apr 12 - Apr 23 milestone Apr 9, 2021
@sameerz sameerz added the P0 Must have for release label Apr 27, 2021
@tgravescs tgravescs self-assigned this Apr 28, 2021
@tgravescs
Copy link
Collaborator

tgravescs commented May 13, 2021

The Spark behavior of concatws:

Separator parameter Api differences:

  • SQL can pass column of strings as separator
  • python and scala api's, only string as separator

Null Behavior:

  • if separator is NULL, columns are null
  • if column value is null it leaves off separator and value
  • if passing in array of all nulls, then its left out without separator
  • if array contains a null, it skips it and leaves separator off -> *** this seems different then cudf (https://github.com/rapidsai/cudf/pull/7929/files) but need to see if can work around***
  • if all the values of a row are null, it returns an empty string -> this seems different then cudf behavior

Behavior:

  • 0 input columns with separate specified return column of empty strings, except if separator null, then value is null.
  • 1 column just return column unless Null separator then need column of null match number of rows
  • if column value is empty is puts on separator
  • if passing in empty array, then its left off without separator
  • if passing array with empty strings, separator included
  • 2 or more columns add in separator between column values
  • For SQL if column specified as separator, the separator in the corresponding row is used
  • Empty Array would normally return empty string, but it appears it actually special if joining with something else, it leaves off the separator for that empty array, comment below shows example

@tgravescs
Copy link
Collaborator

tgravescs commented May 13, 2021

Note that concat and concat_ws have different behavior for nulls when all rows are null:

+-------+
|nullcol|
+-------+
|   null|
|notnull|
+-------+

>>> spark.sql("select concat_ws('-', nullcol, nullcol) as res from df").show(truncate=False)
+---------------+
|res            |
+---------------+
|               |
|notnull-notnull|
+---------------+

>>> spark.sql("select concat(nullcol, nullcol) as res from df").show(truncate=False)
+--------------+
|res           |
+--------------+
|null          |
|notnullnotnull|
+--------------+

>>> spark.sql("select concat(null, d) as res from df").show(truncate=False)
+----+
|res |
+----+
|null|
|null|
+----+

@tgravescs
Copy link
Collaborator

tgravescs commented May 14, 2021

similar cudf behavior for arrays with nulls doesn't match Spark, cudf will put null if any elements in array null, spark skips them. if all nulls, then get empty string

>>> res = spark.sql("select concat_ws('-', array(1, 2, null, 4)) as res from df")
>>> res.show()
+-----+
|  res|
+-----+
|1-2-4|
|1-2-4|
+-----+

>>> spark.sql("select concat_ws('-', array(null, null)) as res from df").show(truncate=False)
+---+
|res|
+---+
|   |
|   |
+---+

array handling for concat is different:

>>> spark.sql("select concat(array(null, 1), array(d)) as res from df").show(truncate=False)
+---------------+
|res            |
+---------------+
|[null, 1, 123] |
|[null, 1, 1234]|
+---------------+

@tgravescs
Copy link
Collaborator

I discovered another weird case with the CPU where is you are concatenating an empty array and then another value, it leaves off the separator:

+--------------+
|      arrnames|
+--------------+
|            []|
|[Alice2, Bob2]|
+--------------+

>>> spark.conf.set("spark.rapids.sql.enabled", "false")
>>> res = df.select(concat_ws('-', df.arrnames, lit('z')))
>>> res.show()
+-------------------------+
|concat_ws(-, arrnames, z)|
+-------------------------+
|                        z|
|            Alice2-Bob2-z|
+-------------------------+

@tgravescs
Copy link
Collaborator

another example of arrays with null in middle:

+--------------+
|      arrnames|
+--------------+
|[a, null, ccc]|
|[Alice2, Bob2]|
+--------------+

res = df.select(concat_ws('***',df.arrnames).alias('s'))

+-------------+
|            s|
+-------------+
|      a***ccc|
|Alice2***Bob2|
+-------------+

@tgravescs
Copy link
Collaborator

tgravescs commented May 19, 2021

example of spark with array of nulls, doesn't matter how many nulls in array, as long as all nulls, spark skips it and leaves off separator.

+-------+
|   name|
+-------+
|beatles|
|  romeo|
+-------+

dfnew.select(concat_ws("-", array(lit(null)), col("name"), lit('a')).alias("s")).show()
+---------+
|        s|
+---------+
|beatles-a|
|  romeo-a|
+---------+

@tgravescs
Copy link
Collaborator

cudf Java layer PR: rapidsai/cudf#8289

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request P0 Must have for release SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants