Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support format_number #9173

Closed
viadea opened this issue Sep 1, 2023 · 9 comments · Fixed by #9790
Closed

[FEA] Support format_number #9173

viadea opened this issue Sep 1, 2023 · 9 comments · Fixed by #9790
Assignees
Labels
feature request New feature or request

Comments

@viadea
Copy link
Collaborator

viadea commented Sep 1, 2023

I wish we can support format_number function.
eg:

select format_number(ss_wholesale_cost, 5) from store_sales limit 5;

      ! <FormatNumber> format_number(ss_wholesale_cost#95, 5) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.FormatNumber
@viadea viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Sep 1, 2023
@revans2
Copy link
Collaborator

revans2 commented Sep 5, 2023

This is fun because we essentially have to match java DecimalFormat code

https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/text/DecimalFormat.html

Because that is what it uses. Happily it looks like the formatting is hard coded to the Locale.US so we don't have to worry about other locales and ways of formatting the numbers.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Sep 6, 2023
@thirtiseven
Copy link
Collaborator

thirtiseven commented Sep 7, 2023

format_number supports both integer and string (spark sql only) as second parameters. In the string case we can format input numbers with a custom pattern. The logic to handle the pattern in Java DecimalFormat code is quite complicated and we may need to write a CUDA kernel for it. But I think the number case can be more easily supported in plugin.

@viadea Is it sufficient that second parameter is literal integer as first step for the customer's request? We can fully support the string pattern (maybe in cuDF or spark-rapids-jni) later.

@thirtiseven
Copy link
Collaborator

thirtiseven commented Sep 8, 2023

Since much of the complexity happens in parsing the format string in Java DecimalFormat code, I think it is possible to implement it in plugin side with many substr/concat if the format string is literal.

Most logic for format string parsing can follow or call DecimalFormat but we need to rewrite the logic for formatting input numbers to adapt cuDF api.

I plan to work on supporting integer as 2nd parameter first to verify my solution, most code can be reused for string format case.

@viadea
Copy link
Collaborator Author

viadea commented Sep 11, 2023

Above is just my example.
The real customer use case is:

<FormatNumber> format_number(somecol, 5) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.FormatNumber

where somecol is a double type.

@thirtiseven
Copy link
Collaborator

Unfortunately I think the float/double type for first parameter is also unable to be fully supported on plugin side, because casting float/double to string doesn't match Spark/Java's result, see compatibility doc.

In normal way, we should first convert a float/double to string correctly before formatting it, and I also didn't find workaround to get enough information I want.

I will create a PR to support other types and part support float/double soon. We may need a custom kernel for float to string casting, see #4204.

@thirtiseven
Copy link
Collaborator

Hi @viadea, do we have more information about the range/precision of the double customer will use? It could be difficult to exactly match Spark's behavior for doubles with high precision, but it will be easy to match them in a limited precision.

@viadea
Copy link
Collaborator Author

viadea commented Oct 20, 2023

Hi @viadea, do we have more information about the range/precision of the double customer will use? It could be difficult to exactly match Spark's behavior for doubles with high precision, but it will be easy to match them in a limited precision.

Let me check that and will update you internally once i get the answer.

@thirtiseven
Copy link
Collaborator

To implement float to string part in the format_number, the solution from cuDF (draft PR NVIDIA/spark-rapids-jni#1508) does not look good enough because of rounding off errors in high precision.

Instead I found https://github.com/ulfjack/ryu which is a popular solution of float to string and it has a C implementation in Apache license. If we can get approval to use some code from it, the custom kernel development will be easier.

However It also can't fully match Java's results and the mismatched part is because Java's result is not good enough. (Actually Java switched to a new solution for float to string in higher version of JDK to fix those issues.) Since we can't use Java's code because of the license, it will be difficult to match Java's bug perfectly.

@revans2 Do you think it's ok to use this solution for the float to string casting kernel?

@revans2
Copy link
Collaborator

revans2 commented Oct 30, 2023

@thirtiseven I am fine with a different implementation for float/double to string and string to float/double, so long as

  1. It is documented exactly what we are doing and when things might be different. Like the result you get back is different based on JDK version used in java, but ours is not and here is why.
  2. It is self consistent. In that we get the same result back for the same input each time.

It would also be nice if we could be consistent in how we round trip the data float -> string -> float, double -> string -> double each produce the same value as the input (bit for bit). But that is not a requirement in any way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants