[FEA] support ascii function #9585

nvliyuan · 2023-10-31T10:43:59Z

I wish we can support ascii function.

eg.

scala> spark.sql("select ascii(col4) from datavalid3").show
23/10/31 09:34:43 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  !Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
    @Expression <Alias> cast(ascii(col4#85) as string) AS ascii(col4)#94 could run on GPU
      !Expression <Cast> cast(ascii(col4#85) as string) cannot run on GPU because Only UTC zone id is supported. Actual default zone id: Africa/Cairo; Only UTC zone id is supported. Actual session local zone id: Africa/Cairo
        ! <Ascii> ascii(col4#85) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.Ascii
          @Expression <AttributeReference> col4#85 could run on GPU
    *Exec <FileSourceScanExec> will run on GPU

The text was updated successfully, but these errors were encountered:

revans2 · 2023-10-31T14:34:47Z

ASCII is an interesting one. It returns the numeric value of the first character of a string. So it works for more than just ASCII encoded values, but whatever. The Spark code is fairly simple with only a few corner cases null => null empty string => 0.

https://github.com/apache/spark/blob/102daf9d1490d12b812be4432c77ce102e82c3bb/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L2333-L2341

The problem is for CUDF.

Doing the substring is simple enough

https://github.com/rapidsai/cudf/blob/b4746d8064198d6287c83aeeaa470b499ddd2e10/cpp/include/cudf/strings/slice.hpp#L61

But converting the remaining byte array to an int. meaning the equivalent of codePointAt(0) I don't know how to do that with existing cudf functionality.

https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#codePointAt-int-

https://github.com/rapidsai/cudf/blob/b4746d8064198d6287c83aeeaa470b499ddd2e10/cpp/include/cudf/strings/detail/utf8.hpp#L155-L177

Is the code that we need to do it, but it is not in a standalone kernel, and if we have to write a new kernel to support this, then we might as well write one that does everything we need. Unless CUDF has a need for something similar.

revans2 · 2023-10-31T14:41:56Z

I guess we could do our own utf8 to codepoint conversion using bit shifts/casts comparisons/etc to make it work. But that feels really really slow and potentially error prone.

davidwendt · 2023-11-06T19:17:20Z

Take a look at cudf::strings::code_points
Here is the test code for it: https://github.com/rapidsai/cudf/blob/b4746d8064198d6287c83aeeaa470b499ddd2e10/cpp/tests/strings/attrs_tests.cpp#L31
Perhaps after slicing the first character, this API could be used to get the code-point value you need?

revans2 · 2023-11-06T19:55:40Z

@davidwendt Thanks, I missed that one.

https://github.com/rapidsai/cudf/blob/70c4283dbd6700bba43440e07b293b2294ea6634/cpp/include/cudf/strings/attributes.hpp#L86

So it looks like we just need to put in some cudf jni APIs for the codepoints API.

thirtiseven · 2023-12-05T09:04:50Z

code_points returns utf-8 decode results, and Spark cpu / Java codePointAt returns unicode code point results. They are mismatched in most cases, so it seems that code_points cannot be used directly to fully match ascii.

I think it needs a custom kernel to fully supported it, but if we only want to support ascii and latin-1, we can support it quickly in plugin.

In databricks's doc, it says that

If the first character is not an ASCII character or part of the Latin-1 Supplement range of UTF-16, the result is undefined.

Under my tests, the behavior between Apache Spark and Databricks are matched, which is Java codePointAt's behavior.

Here is a work around to support ascii 0~255 only without kernel:

ascii	utf-8 dec	workaround
0~127	0~127	cudf::code_points
128~191	49792~49855	code_points - 49,664
192~255	50048~50111	code_points -49,856

thirtiseven · 2023-12-05T09:10:55Z

@nvliyuan, is it good enough for customers to just support ASCII 0~255 (ASCII and Latin-1 Supplement) as a first step?

revans2 · 2023-12-05T17:08:30Z

@nvliyuan, is it good enough for customers to just support ASCII 0~255 (ASCII and Latin-1 Supplement) as a first step?

I am fine if we restrict the range of values supported, but it has to be off by default unless we can match the result in all cases. Databricks can say that some ranges are undefined behavior, but Apache Spark does not

https://spark.apache.org/docs/latest/api/sql/index.html#ascii

And neither does java.

https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#codePointAt-int-

So what we are saying is that java's codePointAt API has undefined behavior for any character not in the range specified by Databricks?

nvliyuan · 2023-12-06T05:58:35Z

@nvliyuan, is it good enough for customers to just support ASCII 0~255 (ASCII and Latin-1 Supplement) as a first step?

Confirmed with customers, ASCII 0~255 is good enough, for others we need to fallback rather than crash.

thirtiseven · 2023-12-06T09:01:51Z

So what we are saying is that java's codePointAt API has undefined behavior for any character not in the range specified by Databricks?

I think the doc is to explain that the result of ascii is really ascii for 0~255, but not real ascii (it's utf-16 codepoints) out of the range. Java's codePointAt should works fine even inside Databricks.

nvliyuan · 2023-12-06T09:39:50Z

I believe GPU job should always return the same result as CPU run, otherwise it should be regard as a bug? @revans2

revans2 · 2023-12-06T14:32:05Z

Confirmed with customers, ASCII 0~255 is good enough, for others we need to fallback rather than crash.

We cannot fall back for the others. We don't have enough information by the time that we start planning to be able to fall back. If we knew what the characters were at planning time we could just replace it with the result up front.

I believe GPU job should always return the same result as CPU run, otherwise it should be regard as a bug?

If the ASCII operator is on by default then it must do the same thing as the CPU. If it is off by default and a user has to opt into using it, then we can have some limitations, but I really would prefer for us to make it work exactly the same, even if we have to have a work around like in #9585 (comment) We can improve the speed with a customer kernel in the future if we see that it is needed.

nvliyuan · 2023-12-07T05:38:28Z

Thanks for the explanation, I will double-confirm and update.

nvliyuan · 2023-12-08T08:25:44Z

@thirtiseven confirmed ASCII 0~255 is good enough, thx

This implements JNI work for strings::code_points to expose the API to Java usage. It will be useful for NVIDIA/spark-rapids#9585 Authors: - Haoyang Li (https://github.com/thirtiseven) - Chong Gao (https://github.com/res-life) Approvers: - Jason Lowe (https://github.com/jlowe) - Nghia Truong (https://github.com/ttnghia) URL: #14533

nvliyuan added feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 31, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 31, 2023

thirtiseven mentioned this issue Nov 30, 2023

Add JNI for strings::code_points rapidsai/cudf#14533

Merged

3 tasks

thirtiseven self-assigned this Dec 5, 2023

thirtiseven mentioned this issue Dec 14, 2023

Support Ascii function for ascii and latin-1 [databricks] #10054

Merged

thirtiseven closed this as completed in #10054 Dec 30, 2023

amahussein mentioned this issue Jan 2, 2024

[BUG] Support Ascii function NVIDIA/spark-rapids-tools#712

Closed

amahussein mentioned this issue Jan 23, 2024

[FEA] Support Ascii function for ascii and latin-1 in Qualification NVIDIA/spark-rapids-tools#738

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] support ascii function #9585

[FEA] support ascii function #9585

nvliyuan commented Oct 31, 2023

revans2 commented Oct 31, 2023

revans2 commented Oct 31, 2023

davidwendt commented Nov 6, 2023

revans2 commented Nov 6, 2023

thirtiseven commented Dec 5, 2023

thirtiseven commented Dec 5, 2023

revans2 commented Dec 5, 2023

nvliyuan commented Dec 6, 2023

thirtiseven commented Dec 6, 2023

nvliyuan commented Dec 6, 2023

revans2 commented Dec 6, 2023

nvliyuan commented Dec 7, 2023

nvliyuan commented Dec 8, 2023

[FEA] support ascii function #9585

[FEA] support ascii function #9585

Comments

nvliyuan commented Oct 31, 2023

revans2 commented Oct 31, 2023

revans2 commented Oct 31, 2023

davidwendt commented Nov 6, 2023

revans2 commented Nov 6, 2023

thirtiseven commented Dec 5, 2023

thirtiseven commented Dec 5, 2023

revans2 commented Dec 5, 2023

nvliyuan commented Dec 6, 2023

thirtiseven commented Dec 6, 2023

nvliyuan commented Dec 6, 2023

revans2 commented Dec 6, 2023

nvliyuan commented Dec 7, 2023

nvliyuan commented Dec 8, 2023