Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] support ascii function #9585

Closed
nvliyuan opened this issue Oct 31, 2023 · 13 comments · Fixed by #10054
Closed

[FEA] support ascii function #9585

nvliyuan opened this issue Oct 31, 2023 · 13 comments · Fixed by #10054
Assignees
Labels
feature request New feature or request

Comments

@nvliyuan
Copy link
Collaborator

I wish we can support ascii function.

eg.

scala> spark.sql("select ascii(col4) from datavalid3").show
23/10/31 09:34:43 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  !Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
    @Expression <Alias> cast(ascii(col4#85) as string) AS ascii(col4)#94 could run on GPU
      !Expression <Cast> cast(ascii(col4#85) as string) cannot run on GPU because Only UTC zone id is supported. Actual default zone id: Africa/Cairo; Only UTC zone id is supported. Actual session local zone id: Africa/Cairo
        ! <Ascii> ascii(col4#85) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.Ascii
          @Expression <AttributeReference> col4#85 could run on GPU
    *Exec <FileSourceScanExec> will run on GPU
@nvliyuan nvliyuan added feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 31, 2023
@revans2
Copy link
Collaborator

revans2 commented Oct 31, 2023

ASCII is an interesting one. It returns the numeric value of the first character of a string. So it works for more than just ASCII encoded values, but whatever. The Spark code is fairly simple with only a few corner cases null => null empty string => 0.

https://github.com/apache/spark/blob/102daf9d1490d12b812be4432c77ce102e82c3bb/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L2333-L2341

The problem is for CUDF.

Doing the substring is simple enough

https://github.com/rapidsai/cudf/blob/b4746d8064198d6287c83aeeaa470b499ddd2e10/cpp/include/cudf/strings/slice.hpp#L61

But converting the remaining byte array to an int. meaning the equivalent of codePointAt(0) I don't know how to do that with existing cudf functionality.

https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#codePointAt-int-

https://github.com/rapidsai/cudf/blob/b4746d8064198d6287c83aeeaa470b499ddd2e10/cpp/include/cudf/strings/detail/utf8.hpp#L155-L177

Is the code that we need to do it, but it is not in a standalone kernel, and if we have to write a new kernel to support this, then we might as well write one that does everything we need. Unless CUDF has a need for something similar.

@revans2
Copy link
Collaborator

revans2 commented Oct 31, 2023

I guess we could do our own utf8 to codepoint conversion using bit shifts/casts comparisons/etc to make it work. But that feels really really slow and potentially error prone.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 31, 2023
@davidwendt
Copy link

Take a look at cudf::strings::code_points
Here is the test code for it: https://github.com/rapidsai/cudf/blob/b4746d8064198d6287c83aeeaa470b499ddd2e10/cpp/tests/strings/attrs_tests.cpp#L31
Perhaps after slicing the first character, this API could be used to get the code-point value you need?

@revans2
Copy link
Collaborator

revans2 commented Nov 6, 2023

@davidwendt Thanks, I missed that one.

https://github.com/rapidsai/cudf/blob/70c4283dbd6700bba43440e07b293b2294ea6634/cpp/include/cudf/strings/attributes.hpp#L86

So it looks like we just need to put in some cudf jni APIs for the codepoints API.

@thirtiseven
Copy link
Collaborator

code_points returns utf-8 decode results, and Spark cpu / Java codePointAt returns unicode code point results. They are mismatched in most cases, so it seems that code_points cannot be used directly to fully match ascii.

I think it needs a custom kernel to fully supported it, but if we only want to support ascii and latin-1, we can support it quickly in plugin.

In databricks's doc, it says that

If the first character is not an ASCII character or part of the Latin-1 Supplement range of UTF-16, the result is undefined.

Under my tests, the behavior between Apache Spark and Databricks are matched, which is Java codePointAt's behavior.

Here is a work around to support ascii 0~255 only without kernel:

ascii utf-8 dec workaround
0~127 0~127 cudf::code_points
128~191 49792~49855 code_points - 49,664
192~255 50048~50111 code_points -49,856

@thirtiseven
Copy link
Collaborator

@nvliyuan, is it good enough for customers to just support ASCII 0~255 (ASCII and Latin-1 Supplement) as a first step?

@thirtiseven thirtiseven self-assigned this Dec 5, 2023
@revans2
Copy link
Collaborator

revans2 commented Dec 5, 2023

@nvliyuan, is it good enough for customers to just support ASCII 0~255 (ASCII and Latin-1 Supplement) as a first step?

I am fine if we restrict the range of values supported, but it has to be off by default unless we can match the result in all cases. Databricks can say that some ranges are undefined behavior, but Apache Spark does not

https://spark.apache.org/docs/latest/api/sql/index.html#ascii

And neither does java.

https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#codePointAt-int-

So what we are saying is that java's codePointAt API has undefined behavior for any character not in the range specified by Databricks?

@nvliyuan
Copy link
Collaborator Author

nvliyuan commented Dec 6, 2023

@nvliyuan, is it good enough for customers to just support ASCII 0~255 (ASCII and Latin-1 Supplement) as a first step?

Confirmed with customers, ASCII 0~255 is good enough, for others we need to fallback rather than crash.

@thirtiseven
Copy link
Collaborator

So what we are saying is that java's codePointAt API has undefined behavior for any character not in the range specified by Databricks?

I think the doc is to explain that the result of ascii is really ascii for 0~255, but not real ascii (it's utf-16 codepoints) out of the range. Java's codePointAt should works fine even inside Databricks.

@nvliyuan
Copy link
Collaborator Author

nvliyuan commented Dec 6, 2023

I believe GPU job should always return the same result as CPU run, otherwise it should be regard as a bug? @revans2

@revans2
Copy link
Collaborator

revans2 commented Dec 6, 2023

Confirmed with customers, ASCII 0~255 is good enough, for others we need to fallback rather than crash.

We cannot fall back for the others. We don't have enough information by the time that we start planning to be able to fall back. If we knew what the characters were at planning time we could just replace it with the result up front.

I believe GPU job should always return the same result as CPU run, otherwise it should be regard as a bug?

If the ASCII operator is on by default then it must do the same thing as the CPU. If it is off by default and a user has to opt into using it, then we can have some limitations, but I really would prefer for us to make it work exactly the same, even if we have to have a work around like in #9585 (comment) We can improve the speed with a customer kernel in the future if we see that it is needed.

@nvliyuan
Copy link
Collaborator Author

nvliyuan commented Dec 7, 2023

Thanks for the explanation, I will double-confirm and update.

@nvliyuan
Copy link
Collaborator Author

nvliyuan commented Dec 8, 2023

@thirtiseven confirmed ASCII 0~255 is good enough, thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants