Support float/double castings for ORC reading [databricks] #6319

sinkinben · 2022-08-15T10:56:04Z

Close #6291, (which is sub-issue of #6149 ).

To implement casting float/double to {bool, integer types, double/float, string, timestamp}.

double is also known as float64. Integer types include int8/16/32/64.

Implementation

Casting	Implementation Description
`float/double -> {bool, int8/16/32/64}`	1. First replace rows that cannot fit in long with nulls. 2. Convert the ColumnVector to Long type 3. Down cast long to the target integral type.
`float <-> double`	1. Call `ColumnView.castTo`. 2. When casting `double -> float`, if `double` value is greater than `FLOAT_MAX`, then mark this value with `Infinite`.
`float/double -> string`	1. cuDF keep 9 decimal numbers after the decimal point, and CPU keeps more than 10. 2. Added a config item `spark.rapids.sql.format.orc.floatTypesToString.enable` (default value is true) to control whether if we can cast `float/double -> string` while reading ORC.
`float/double -> timestamp`	1. ORC assumes the original `float/double` values are in seconds. 2. If `ROUND(val * 1000) > LONG_MAX` , replace it with null, e.g. `val = 1e20`. Otherwise, keep these values, and convert them into milli-seonds vector. 3. Multiply 1000, convert them into micro-seconds vector. Pay attention to long(INT64) overflow here, since timestamp is stored in `INT64`.

* Impl: float/double -> double/float, bool, int8/16/32/64, timestamp * There are some precision issue when casting float/double to string * IT Test: need a function float_gen with range [a, b] Signed-off-by: sinkinben <sinkinben@outlook.com>

Signed-off-by: sinkinben <sinkinben@outlook.com>

andygrove · 2022-08-15T18:12:40Z

integration_tests/src/main/python/tmp_test.py

+# TODO: merge test_casting_from_float and test_casting_from_double into one test
+# TODO: Need a float_gen with range [a, b], if float/double >= 1e13, then float/double -> timestamp will overflow


Do these TODO comments still need addressing in this PR, or require follow-on issues to be filed?

Do these TODO comments still need addressing in this PR, or require follow-on issues to be filed?

Yep, I plan to address these TODOs in this PR.

andygrove · 2022-08-15T18:15:00Z

integration_tests/src/main/python/tmp_test.py

+    assert_gpu_and_cpu_are_equal_collect(
+        lambda spark: spark.read.schema(schema_str).orc(orc_path)
+    )
+


nit: too many blank lines between tests. I believe the convention is to have 2, although we do not do this consistently in our tests.

Signed-off-by: sinkinben <sinkinben@outlook.com>

sinkinben · 2022-08-16T09:22:45Z

For the precision problem of float/double -> string, there exists a similar operation in sql-cast, e.g. "select cast(float_col as string) from table".

And in sql-cast, it's controlled by a conf:

lazy val isCastFloatToStringEnabled: Boolean = get(ENABLE_CAST_FLOAT_TO_STRING)
val ENABLE_CAST_FLOAT_TO_STRING = conf("spark.rapids.sql.castFloatToString.enabled")
    .doc("Casting from floating point types to string on the GPU returns results that have " +
      "a different precision than the default results of Spark.")
    .booleanConf
    .createWithDefault(true)

In GpuCast.scala, it will check the plan tree in the recursive way.

private def recursiveTagExprForGpuCheck(...) {
  ...
  case (_: FloatType | _: DoubleType, _: StringType) if !conf.isCastFloatToStringEnabled =>
    willNotWorkOnGpu("the GPU will use different precision than Java's toString method when " +
        "converting floating point data types to strings and this can produce results that " +
        "differ from the default behavior in Spark.  To enable this operation on the GPU, set" +
        s" ${RapidsConf.ENABLE_CAST_FLOAT_TO_STRING} to true.")
}

I think we can handle the precision problem of float/double->string in a similar way here.

…ading. Signed-off-by: sinkinben <sinkinben@outlook.com>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

* Fixed bug when all elements in ColumnVector are null * Updated IT tests Signed-off-by: sinkinben <sinkinben@outlook.com>

Signed-off-by: sinkinben <sinkinben@outlook.com>

revans2 · 2022-08-29T14:26:54Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

@@ -864,6 +864,16 @@ object RapidsConf {
    .booleanConf
    .createWithDefault(true)

+  val ENABLE_ORC_FLOAT_TYPES_TO_STRING =


Please add some docs to this that indicate what will happen if we run into this situation. For most configs when we are in this kind of a situation we fall back to the CPU, but here we will throw an exception and the job will fail.

Have updated this in config.md.

revans2 · 2022-08-29T14:31:48Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala

+      // We let a conf 'spark.rapids.sql.format.orc.floatTypesToString.enable' to control it's
+      // enable or not.
+      case (DType.FLOAT32 | DType.FLOAT64, DType.STRING) =>
+        GpuCast.castFloatingTypeToString(col)


Can you please file a follow on issue for us to go back an see what we can do to fix this?

Can you please file a follow on issue for us to go back an see what we can do to fix this?

Ok, after merging this, I will file an issue to describe this problem.

revans2 · 2022-08-29T14:33:21Z

integration_tests/src/main/python/orc_cast_test.py

+
+# When casting float/double to double/float, we need to compare values of GPU with CPU
+# in an approximate way.
+@pytest.mark.approximate_float


Are we off if we don't do this? It feels odd that we would get a different answer.

Are we off if we don't do this? It feels odd that we would get a different answer.

Yep, it's okay to remove approximate_float, we can still pass the test.

But I think we should pay attention to the method of comparing float types numbers whether if they are equal.

For example,

scala> var k = (3.14).toFloat var k: Float = 3.14 scala> k.toDouble val res3: Double = 3.140000104904175

I don't know whether if the conversion float -> double in GPU is same as CPU.

We should check two float types numbers if they're equal via abs(val1 - val2) < EPSLION, where EPSILON is the allowable accuracy error.

…STRING Signed-off-by: sinkinben <sinkinben@outlook.com>

Signed-off-by: sinkinben <sinkinben@outlook.com>

revans2 · 2022-09-06T17:55:40Z

build

sinkinben added 3 commits August 15, 2022 18:17

implement casting float/double

c312ecb

* Impl: float/double -> double/float, bool, int8/16/32/64, timestamp * There are some precision issue when casting float/double to string * IT Test: need a function float_gen with range [a, b] Signed-off-by: sinkinben <sinkinben@outlook.com>

Check long-overflow when casting float/double to timestamp

a729c86

Signed-off-by: sinkinben <sinkinben@outlook.com>

Add comments in temp IT tests

bb2d3a6

Signed-off-by: sinkinben <sinkinben@outlook.com>

sameerz added the task Work required that improves the product but is not user facing label Aug 15, 2022

andygrove reviewed Aug 15, 2022

View reviewed changes

sinkinben added 2 commits August 16, 2022 15:25

Update float/double -> timestamp, and add more test cases

ef1163d

Signed-off-by: sinkinben <sinkinben@outlook.com>

Update comments

2930ab8

Signed-off-by: sinkinben <sinkinben@outlook.com>

sinkinben mentioned this pull request Aug 16, 2022

Implement all the casting cases that GPU can support for ORC reading. #6149

Open

Add a config item to control casting float/double to string in ORC re…

e785de1

…ading. Signed-off-by: sinkinben <sinkinben@outlook.com>

sinkinben marked this pull request as ready for review August 17, 2022 08:19

sinkinben requested review from firestarman and andygrove August 17, 2022 08:27

firestarman reviewed Aug 18, 2022

View reviewed changes

firestarman mentioned this pull request Aug 18, 2022

Support bool/int8/int16/int32/int64 castings for ORC reading. #6273

Merged

Refined code according to review comments

c6e05a5

* Fixed bug when all elements in ColumnVector are null * Updated IT tests Signed-off-by: sinkinben <sinkinben@outlook.com>

sinkinben self-assigned this Aug 18, 2022

sinkinben requested a review from firestarman August 22, 2022 05:38

sinkinben added 2 commits August 29, 2022 16:27

Merge branch 'branch-22.10' into issue-6291-cast-float-double

69e9d14

Refined some comments, and move orc_cast_float_test to orc_cast_test

0cf548a

Signed-off-by: sinkinben <sinkinben@outlook.com>

sinkinben requested a review from revans2 August 29, 2022 09:23

revans2 reviewed Aug 29, 2022

View reviewed changes

sinkinben added 2 commits August 31, 2022 15:26

Merge branch 'branch-22.10' into issue-6291-cast-float-double

0b2a675

Fixed some conflicts and update docs about ENABLE_ORC_FLOAT_TYPES_TO_…

e6e1aa9

…STRING Signed-off-by: sinkinben <sinkinben@outlook.com>

sinkinben changed the title ~~Support float/double castings for ORC reading.~~ Support float/double castings for ORC reading [databricks] Aug 31, 2022

sinkinben added 3 commits September 6, 2022 15:24

Merge branch 'branch-22.10' into issue-6291-cast-float-double

d427a4d

Refined float/double -> timestamp, and fixed some conflicts

db0f0d2

Signed-off-by: sinkinben <sinkinben@outlook.com>

Fixed memory SEG fault

46e71f9

Signed-off-by: sinkinben <sinkinben@outlook.com>

sinkinben requested a review from revans2 September 6, 2022 08:49

revans2 approved these changes Sep 6, 2022

View reviewed changes

sinkinben merged commit 165dac4 into NVIDIA:branch-22.10 Sep 7, 2022

sinkinben mentioned this pull request Sep 7, 2022

[FEA] Casting float numbers to string may differ from CPU in ORC schema evolution #6513

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support float/double castings for ORC reading [databricks] #6319

Support float/double castings for ORC reading [databricks] #6319

sinkinben commented Aug 15, 2022 •

edited

Loading

andygrove Aug 15, 2022

sinkinben Aug 16, 2022

andygrove Aug 15, 2022

sinkinben Sep 7, 2022

sinkinben commented Aug 16, 2022 •

edited

Loading

revans2 Aug 29, 2022

sinkinben Sep 6, 2022

revans2 Aug 29, 2022

sinkinben Aug 31, 2022

revans2 Aug 29, 2022

sinkinben Aug 31, 2022 •

edited

Loading

revans2 commented Sep 6, 2022

		# TODO: merge test_casting_from_float and test_casting_from_double into one test
		# TODO: Need a float_gen with range [a, b], if float/double >= 1e13, then float/double -> timestamp will overflow

Support float/double castings for ORC reading [databricks] #6319

Support float/double castings for ORC reading [databricks] #6319

Conversation

sinkinben commented Aug 15, 2022 • edited Loading

Implementation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sinkinben commented Aug 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sinkinben Aug 31, 2022 • edited Loading

Choose a reason for hiding this comment

revans2 commented Sep 6, 2022

sinkinben commented Aug 15, 2022 •

edited

Loading

sinkinben commented Aug 16, 2022 •

edited

Loading

sinkinben Aug 31, 2022 •

edited

Loading