Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement all the casting cases that GPU can support for ORC reading. #6149

Open
Tracked by #5895
firestarman opened this issue Jul 29, 2022 · 4 comments
Open
Tracked by #5895
Labels
task Work required that improves the product but is not user facing

Comments

@firestarman
Copy link
Collaborator

firestarman commented Jul 29, 2022

There will be more than 100 cases. We may need multiple sub issues for this.
Click to see full type casting list CPU ORC supports.

@firestarman firestarman changed the title Implement all the casting cases that GPU can support. There will be more than 100 cases. We may need multiple sub issues for this.(Here is full type casting list CPU ORC supports.) Implement all the casting cases that GPU can support. Jul 29, 2022
@firestarman firestarman changed the title Implement all the casting cases that GPU can support. Implement all the casting cases that GPU can support for ORC reading. Jul 29, 2022
@sameerz sameerz added the ? - Needs Triage Need team to review and classify label Jul 29, 2022
@jlowe
Copy link
Member

jlowe commented Aug 2, 2022

Note that for the CHAR type, casting to a string requires stripping the trailing whitespaces from the value to match the CPU behavior. See #6188 (comment). It would be nice if we could ask libcudf to load the CHAR column by stripping trailing whitespace instead of adding it, so we don't have to perform a post-processing step on the CHAR columns.

@sameerz sameerz added task Work required that improves the product but is not user facing and removed ? - Needs Triage Need team to review and classify labels Aug 2, 2022
@sinkinben
Copy link
Contributor

sinkinben commented Aug 10, 2022

I divide these castings into these subcategories, according to the source type.





  • timestamp -> {integer types, float32, double, string, date}

Two special case:

  • decimal <-> {bool, integer types, float, double, string, timestamp}
  • binary <-> string

Whitespaces of char/varchar/string should be paied attention to, which is mentioned above.

@sinkinben
Copy link
Contributor

sinkinben commented Aug 17, 2022

A Summary of Implementation Details

Casting from Integer Types

Casting Implementation Description
bool -> float/double Based on ColumnVector.castTo in cuDF.
bool -> string Call castTo, and convert them into upper cases TRUE/FALSE (as CPU code did).
int8/16/32/64 -> float/double/string Call castTo
bool/int8/16/32 -> timestamp The original value is in seconds, and convert them into micro-seconds. Since timestamp is stored in int64, there is no integer-overflow.
int64 -> timestamp 1. From spark311 until spark320 (inluding 311, 312, 313, 314), they consider the integers as milliseconds when casting integers to timestamp.
2. For spark320+ (including spark320), they consider the integers as seconds.
3. For both cases, convert them to microseconds.

Casting from Float types

Casting Implementation Description
float/double -> {bool, int8/16/32/64} 1. First replace rows that cannot fit in long with nulls.
2. Convert the ColumnVector to Long type
3. Down cast long to the target integral type.
float <-> double 1. Call ColumnView.castTo.
2. When casting double -> float, if double value is greater than FLOAT_MAX, then mark this value with Infinite.
float/double -> string 1. cuDF keep 9 decimal numbers after the decimal point, and CPU keeps more than 10.
2. Added a config item spark.rapids.sql.format.orc.floatTypesToString.enable (default value is true) to control whether if we can cast float/double -> string while reading ORC.
float/double -> timestamp 1. ORC assumes the original float/double values are in seconds.
2. If ROUND(val * 1000) > LONG_MAX , replace it with null, e.g. val = 1e20. Otherwise, keep these values, and convert them into milli-seonds vector.
3. Multiply 1000, convert them into micro-seconds vector. Pay attention to long(INT64) overflow here, since timestamp is stored in INT64.

Casting from string

Casting Implementation Description
string -> bool/int8/16/32/64 1. Check the pattern of input strings by regular expression, replace invalid strings with null.
2. Follow the CPU ORC conversion. Firstly convert string to long, then down cast long to target integral type.
3. For string -> bool, cases "true","false are invalid, they should be "0", "1".
string -> float/double 1. Check the pattern of input strings by regex. The leading/trailing spaces should be ignored. Replace the invalid strings with null.
2. Call castTo(FLOAT) or castTo(DOUBLE) in cuDF.
string -> date/timestamp Working on it.

Casting from Date types (TODO)

Casting Implementation Description
date -> string Call ColumnView.asString(), and it will call asStrings("%Y-%m-%d") inside.
date -> timestamp 1. Convert the date columnar vector into INT64 type.
2. Multiply it with 24 * 60 * 60 * 1e6, and then convert it into TIMESTAMP_MICROSECONDS.

However, there are still some issues. For more details, see the comments in #6357 .

Here is the Code branch.

@sinkinben
Copy link
Contributor

sinkinben commented Aug 30, 2022

As the discussion mentioned in apache/orc#1237,

Both Apache Spark and ORC community recommend to use explicit SQL CAST method instead of depending on data source's Schema Evolution.

That is we can replace Schema evolution with CAST in SQL.

For example, if we have an ORC file, it contains one column date_str, and date_str is some strings with a pattern of YYYY-mm-dd.

# Read `date_str` in type of string, do not use schema evolution
scala> var df = spark.read.schema("date_str string").orc("/tmp/orc/data.orc");
scala> df.show()
+----------+
|  date_str|
+----------+
|2002-01-01|
|2022-08-29|
|2022-08-31|
|2022-01-32|
|9808-02-30|
|2022-06-31|
+----------+
# Cast `date_str` to type of `date`, using SQL-CAST
scala> df.registerTempTable("table")
scala> df.sqlContext.sql("select CAST(date_str as date) from table").show()
+----------+
|  date_str|
+----------+
|2002-01-01|
|2022-08-29|
|2022-08-31|
|      null|
|      null|
|      null|
+----------+

@HaoYang670 HaoYang670 removed their assignment Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

No branches or pull requests

5 participants