Implement all the casting cases that GPU can support for ORC reading. #6149

firestarman · 2022-07-29T02:20:11Z

There will be more than 100 cases. We may need multiple sub issues for this.
Click to see full type casting list CPU ORC supports.

jlowe · 2022-08-02T16:08:14Z

Note that for the CHAR type, casting to a string requires stripping the trailing whitespaces from the value to match the CPU behavior. See #6188 (comment). It would be nice if we could ask libcudf to load the CHAR column by stripping trailing whitespace instead of adding it, so we don't have to perform a post-processing step on the CHAR columns.

sinkinben · 2022-08-10T10:09:39Z

I divide these castings into these subcategories, according to the source type.

integer -> integer (It was done in Set up the framework of type casting for ORC reading #5960 .)
bool/int8/int16/int32/int64 -> {string, float, double(float64), timestamp}.
- Issue [FEA] Support bool/int8/int16/int32/int64 castings for ORC reading. #6272
- PR Support bool/int8/int16/int32/int64 castings for ORC reading. #6273

float32 -> {bool, integer types, double, string, timestamp}
double -> {bool, integer types, float32, string, timestamp}
- Issue [FEA] Support float/double -> {bool, integer types, double/float, string, timestamp} for ORC reading. #6291
- PR Support float/double castings for ORC reading [databricks] #6319

[Pending] date -> {string, timestamp}
- Issue [FEA] Support date casting for ORC reading. #6357
- branch: orc-cast-date

[Pending] string -> {bool, integer types, float32, double, timestamp, date}
- Issue [FEA] Support string castings for ORC reading #6394
- Draft PR Support string castings for ORC reading [databricks] #6411

timestamp -> {integer types, float32, double, string, date}

Two special case:

decimal <-> {bool, integer types, float, double, string, timestamp}
binary <-> string

Whitespaces of char/varchar/string should be paied attention to, which is mentioned above.

sinkinben · 2022-08-17T09:19:05Z

A Summary of Implementation Details

Casting from Integer Types

Casting	Implementation Description
`bool -> float/double`	Based on `ColumnVector.castTo` in cuDF.
`bool -> string`	Call `castTo`, and convert them into upper cases `TRUE/FALSE` (as CPU code did).
`int8/16/32/64 -> float/double/string`	Call `castTo`
`bool/int8/16/32 -> timestamp`	The original value is in seconds, and convert them into micro-seconds. Since `timestamp` is stored in `int64`, there is no integer-overflow.
`int64 -> timestamp`	1. From spark311 until spark320 (inluding 311, 312, 313, 314), they consider the integers as milliseconds when casting integers to timestamp. 2. For spark320+ (including spark320), they consider the integers as seconds. 3. For both cases, convert them to microseconds.

Casting from Float types

Casting	Implementation Description
`float/double -> {bool, int8/16/32/64}`	1. First replace rows that cannot fit in long with nulls. 2. Convert the ColumnVector to Long type 3. Down cast long to the target integral type.
`float <-> double`	1. Call `ColumnView.castTo`. 2. When casting `double -> float`, if `double` value is greater than `FLOAT_MAX`, then mark this value with `Infinite`.
`float/double -> string`	1. cuDF keep 9 decimal numbers after the decimal point, and CPU keeps more than 10. 2. Added a config item `spark.rapids.sql.format.orc.floatTypesToString.enable` (default value is true) to control whether if we can cast `float/double -> string` while reading ORC.
`float/double -> timestamp`	1. ORC assumes the original `float/double` values are in seconds. 2. If `ROUND(val * 1000) > LONG_MAX` , replace it with null, e.g. `val = 1e20`. Otherwise, keep these values, and convert them into milli-seonds vector. 3. Multiply 1000, convert them into micro-seconds vector. Pay attention to long(INT64) overflow here, since timestamp is stored in `INT64`.

Casting from string

Casting	Implementation Description
`string -> bool/int8/16/32/64`	1. Check the pattern of input strings by regular expression, replace invalid strings with `null`. 2. Follow the CPU ORC conversion. Firstly convert string to long, then down cast long to target integral type. 3. For `string -> bool`, cases `"true","false` are invalid, they should be `"0", "1"`.
`string -> float/double`	1. Check the pattern of input strings by regex. The leading/trailing spaces should be ignored. Replace the invalid strings with `null`. 2. Call `castTo(FLOAT)` or `castTo(DOUBLE)` in cuDF.
`string -> date/timestamp`	Working on it.

Casting from Date types (TODO)

Casting	Implementation Description
`date -> string`	Call `ColumnView.asString()`, and it will call `asStrings("%Y-%m-%d")` inside.
`date -> timestamp`	1. Convert the `date` columnar vector into `INT64` type. 2. Multiply it with `24 * 60 * 60 * 1e6`, and then convert it into `TIMESTAMP_MICROSECONDS`.

However, there are still some issues. For more details, see the comments in #6357 .

Here is the Code branch.

sinkinben · 2022-08-30T06:45:32Z

As the discussion mentioned in apache/orc#1237,

Both Apache Spark and ORC community recommend to use explicit SQL CAST method instead of depending on data source's Schema Evolution.

That is we can replace Schema evolution with CAST in SQL.

For example, if we have an ORC file, it contains one column date_str, and date_str is some strings with a pattern of YYYY-mm-dd.

# Read `date_str` in type of string, do not use schema evolution
scala> var df = spark.read.schema("date_str string").orc("/tmp/orc/data.orc");
scala> df.show()
+----------+
|  date_str|
+----------+
|2002-01-01|
|2022-08-29|
|2022-08-31|
|2022-01-32|
|9808-02-30|
|2022-06-31|
+----------+
# Cast `date_str` to type of `date`, using SQL-CAST
scala> df.registerTempTable("table")
scala> df.sqlContext.sql("select CAST(date_str as date) from table").show()
+----------+
|  date_str|
+----------+
|2002-01-01|
|2022-08-29|
|2022-08-31|
|      null|
|      null|
|      null|
+----------+

firestarman mentioned this issue Jul 29, 2022

[FEA] ORC reading supports schema evolution #5895

Open

7 tasks

firestarman changed the title ~~Implement all the casting cases that GPU can support. There will be more than 100 cases. We may need multiple sub issues for this.(Here is full type casting list CPU ORC supports.)~~ Implement all the casting cases that GPU can support. Jul 29, 2022

firestarman changed the title ~~Implement all the casting cases that GPU can support.~~ Implement all the casting cases that GPU can support for ORC reading. Jul 29, 2022

sameerz added the ? - Needs Triage Need team to review and classify label Jul 29, 2022

This was referenced Aug 1, 2022

[BUG] When Hive table's actual data has varchar, but the DDL is string, then query fails to do varchar to string conversion #6160

Closed

Allow ORC conversion from VARCHAR to STRING #6188

Merged

sameerz added task Work required that improves the product but is not user facing and removed ? - Needs Triage Need team to review and classify labels Aug 2, 2022

This was referenced Aug 18, 2022

[FEA] Support date casting for ORC reading. #6357

Open

[FEA] Support string castings for ORC reading #6394

Open

Support string castings for ORC reading [databricks] #6411

Closed

GaryShen2008 assigned HaoYang670 Sep 23, 2022

HaoYang670 removed their assignment Jun 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement all the casting cases that GPU can support for ORC reading. #6149

Implement all the casting cases that GPU can support for ORC reading. #6149

firestarman commented Jul 29, 2022 •

edited

Loading

jlowe commented Aug 2, 2022

sinkinben commented Aug 10, 2022 •

edited

Loading

sinkinben commented Aug 17, 2022 •

edited

Loading

sinkinben commented Aug 30, 2022 •

edited

Loading

Implement all the casting cases that GPU can support for ORC reading. #6149

Implement all the casting cases that GPU can support for ORC reading. #6149

Comments

firestarman commented Jul 29, 2022 • edited Loading

jlowe commented Aug 2, 2022

sinkinben commented Aug 10, 2022 • edited Loading

sinkinben commented Aug 17, 2022 • edited Loading

A Summary of Implementation Details

sinkinben commented Aug 30, 2022 • edited Loading

firestarman commented Jul 29, 2022 •

edited

Loading

sinkinben commented Aug 10, 2022 •

edited

Loading

sinkinben commented Aug 17, 2022 •

edited

Loading

sinkinben commented Aug 30, 2022 •

edited

Loading