-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support sorting on lists of Strings #10184
Comments
Now this looks like a duplicate of #5890 |
Yes this is a special case of #5890 |
This issue has been labeled |
Sill needed |
This issue has been labeled |
Still needed |
You can do this by calling existing libcudf functions. It appears the sort process described in the description is the same as for sorting strings so I think artificially creating a strings column using the lists column's offsets and calling sort would give the correct results. The basic idea is to build a fake column of strings to sort and use that result to build the output lists column.
The fake strings column can be built using the lists offsets and the strings chars and would look like:
Sorting using
Finally, using
Assuming this is one-level deep lists column and nulls handling is not needed (as mentioned in the description) here is some example code that I think achieves this:
|
I believe this was solved by #11129 |
Is your feature request related to a problem? Please describe.
This is from #10181.
Spark's sorting is similar to what we currently do for sorting structs. It also has the same problem in that Spark lets you define at the top level if NULLs come first or last, but then ignores it for nested types. It is okay if the null ordering is consistent at all levels, just like with structs. But that means we will fall back to the CPU if the user requests an ordering that we cannot guarantee will be sorted properly.
So ignoring the null ordering Spark will first compare the elements to each other in the order in which they are in the list. It does this from the first element to the minimum length of the two lists being compared. As soon as it finds a pair of elements that are not equal to each other it uses that as the ordering for the two lists. If all the elements are the same up to the min length of the lists, then the length of the elements is used, and longer lists are considered to come after shorter ones.
For reference the following code is used for doing this comparison. Please note that in Spark a LIST is called an Array https://github.com/apache/spark/blob/66b9087233bc0cb90cf3af07ec34ba74b9c32d5b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ArrayType.scala#L116-L149
for example: the following data is sorted ascending: (note that [] is an empty list)
and the same data descending:
Describe the solution you'd like
Ideally we would like to both sort the data and also merge sort the data. But merge sorting is lower priority and we can work around that.
The text was updated successfully, but these errors were encountered: