Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] concatenate array of strings #7727

Closed
revans2 opened this issue Mar 25, 2021 · 5 comments · Fixed by #7929
Closed

[FEA] concatenate array of strings #7727

revans2 opened this issue Mar 25, 2021 · 5 comments · Fixed by #7929
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Mar 25, 2021

Is your feature request related to a problem? Please describe.
We would like to support Spark's concat_ws function that can take any combination of strings or arrays of strings.

Describe the solution you'd like
cudf already offers a number of string concat APIs that can take a table of strings and concat them. What I would like is the ability to take a single string column that is an array of strings and concatenate them just like the table APIs do. With that and the existing table APIs we should be able to build up concat_ws

Describe alternatives you've considered
There is no good alternative The arrays could be variable length, so we cannot use any of the existing APIs that all assume a fixed number of inputs.

Additional context
Ideally we would want APIs that can either take a scalar string as the separator or a column_view of strings as the separator. If we could only get one of them, then the column_view version would be better.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Mar 25, 2021
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Mar 25, 2021
@kkraus14
Copy link
Collaborator

@revans2 any chance you could write out a basic example of what this operation does? I'm not quite following if this is an elementwise concatenation of the list elements that returns the same number of rows as the input, or similar to a Table concatenation.

@davidwendt
Copy link
Contributor

Also, does this relate to #4728?

@ttnghia ttnghia changed the title [FEA] concatonate array of strings [FEA] concatenate array of strings Mar 26, 2021
@revans2
Copy link
Contributor Author

revans2 commented Mar 26, 2021

OK so here is a spark example for concat_ws

scala> df.selectExpr("concat_ws('-', str_arr, just_str) as together", "str_arr", "just_str").show
+--------+---------+--------+
|together|  str_arr|just_str|
+--------+---------+--------+
| a-b-c-1|[a, b, c]|       1|
|     a-2|      [a]|       2|
+--------+---------+--------+

The first parameter to concate_ws is the separator string. All of the other parameters are to be concatenated together into an output string. If one of the parameters is an array/list of strings then the strings are each pulled out(similar to a flat map) and just treated like params to a regular concat.

If you want to update cudf::strings::concatenate to act just like concat_ws, that is fine, but I don't know if others have similar requirements. I be happy with an API kind of like.

std::unique_ptr<column> concatenate(
  lists_column_view const& strings_list,
  strings_column_view const& separators,
  string_scalar const& separator_narep = string_scalar("", false),
  string_scalar const& col_narep       = string_scalar("", false),
  rmm::mr::device_memory_resource* mr  = rmm::mr::get_current_device_resource());

and possibly

std::unique_ptr<column> concatenate(
  lists_column_view const& strings_list,
  string_scalar const& separator      = string_scalar(""),
  string_scalar const& narep          = string_scalar("", false),
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

These would act exactly like the existing cudf::string::concatenate APIs, except it takes an column vector of a list of strings as input instead of a table. Then, I can walk through all of the parameters passed in to concat_ws and if it is a list of strings I can convert it into a single string using the new API, and finally use the existing APIs to concatenate them all together for the final output.

#4728 is different, but also kind of related. That is a requirement we have to cast a struct to a string. The oddness there is that Spark has two different modes for this. In one case it inserts a "null" for null values. In the other it inserts an empty string. But @rwlee knows the details of what is needed better than I do. But it is related because at some point we will have to be able to cast an array to a string and we will start to run into similar situations that would have overlaps between this request and #4728

@davidwendt
Copy link
Contributor

I'm having trouble mapping the example to the proposed API.
Is the first parameter here supposed to be a vector of lists_column_view perhaps?

std::unique_ptr<column> concatenate(
  lists_column_view const& strings_list,
  string_scalar const& separator      = string_scalar(""),
  string_scalar const& narep          = string_scalar("", false),
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

Where is the just_str parameter from the example? Is it another lists_column_view?

@ttnghia
Copy link
Contributor

ttnghia commented Apr 1, 2021

Hi David. The first parameter is just a lists_column_view, which is a column of lists of strings---each row is a list of multiple strings. Each row of just_str now can be given as one entry of such list row.

rapids-bot bot pushed a commit that referenced this issue Apr 26, 2021
Given a lists column of strings (each row is a list of strings), this PR facilitates the concatenation of strings within each list.

For example:

```
s = [ {'aa', 'bb', 'cc'}, null, {null, 'dd'}, {'ee', 'ff'} ]
r = strings::concatenate_list_elements(s, '+++') 
r is ['aa+++bb+++cc', null, null, 'ee+++ff']
```

This PR is similar to Spark's `concat_ws`, and closes #7727.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #7929
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants