Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support float/double castings for ORC reading [databricks] #6319

Merged
merged 14 commits into from
Sep 7, 2022

Conversation

sinkinben
Copy link
Contributor

@sinkinben sinkinben commented Aug 15, 2022

Close #6291, (which is sub-issue of #6149 ).

To implement casting float/double to {bool, integer types, double/float, string, timestamp}.

double is also known as float64. Integer types include int8/16/32/64.

Implementation

Casting Implementation Description
float/double -> {bool, int8/16/32/64} 1. First replace rows that cannot fit in long with nulls.
2. Convert the ColumnVector to Long type
3. Down cast long to the target integral type.
float <-> double 1. Call ColumnView.castTo.
2. When casting double -> float, if double value is greater than FLOAT_MAX, then mark this value with Infinite.
float/double -> string 1. cuDF keep 9 decimal numbers after the decimal point, and CPU keeps more than 10.
2. Added a config item spark.rapids.sql.format.orc.floatTypesToString.enable (default value is true) to control whether if we can cast float/double -> string while reading ORC.
float/double -> timestamp 1. ORC assumes the original float/double values are in seconds.
2. If ROUND(val * 1000) > LONG_MAX , replace it with null, e.g. val = 1e20. Otherwise, keep these values, and convert them into milli-seonds vector.
3. Multiply 1000, convert them into micro-seconds vector. Pay attention to long(INT64) overflow here, since timestamp is stored in INT64.

* Impl: float/double -> double/float, bool, int8/16/32/64, timestamp
* There are some precision issue when casting float/double to string
* IT Test: need a function float_gen with range [a, b]

Signed-off-by: sinkinben <sinkinben@outlook.com>
Signed-off-by: sinkinben <sinkinben@outlook.com>
Signed-off-by: sinkinben <sinkinben@outlook.com>
@sameerz sameerz added the task Work required that improves the product but is not user facing label Aug 15, 2022
Comment on lines 29 to 30
# TODO: merge test_casting_from_float and test_casting_from_double into one test
# TODO: Need a float_gen with range [a, b], if float/double >= 1e13, then float/double -> timestamp will overflow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these TODO comments still need addressing in this PR, or require follow-on issues to be filed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these TODO comments still need addressing in this PR, or require follow-on issues to be filed?

Yep, I plan to address these TODOs in this PR.

assert_gpu_and_cpu_are_equal_collect(
lambda spark: spark.read.schema(schema_str).orc(orc_path)
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: too many blank lines between tests. I believe the convention is to have 2, although we do not do this consistently in our tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Signed-off-by: sinkinben <sinkinben@outlook.com>
Signed-off-by: sinkinben <sinkinben@outlook.com>
@sinkinben
Copy link
Contributor Author

sinkinben commented Aug 16, 2022

For the precision problem of float/double -> string, there exists a similar operation in sql-cast, e.g. "select cast(float_col as string) from table".

And in sql-cast, it's controlled by a conf:

lazy val isCastFloatToStringEnabled: Boolean = get(ENABLE_CAST_FLOAT_TO_STRING)
val ENABLE_CAST_FLOAT_TO_STRING = conf("spark.rapids.sql.castFloatToString.enabled")
    .doc("Casting from floating point types to string on the GPU returns results that have " +
      "a different precision than the default results of Spark.")
    .booleanConf
    .createWithDefault(true)

In GpuCast.scala, it will check the plan tree in the recursive way.

private def recursiveTagExprForGpuCheck(...) {
  ...
  case (_: FloatType | _: DoubleType, _: StringType) if !conf.isCastFloatToStringEnabled =>
    willNotWorkOnGpu("the GPU will use different precision than Java's toString method when " +
        "converting floating point data types to strings and this can produce results that " +
        "differ from the default behavior in Spark.  To enable this operation on the GPU, set" +
        s" ${RapidsConf.ENABLE_CAST_FLOAT_TO_STRING} to true.")
}

I think we can handle the precision problem of float/double->string in a similar way here.

…ading.

Signed-off-by: sinkinben <sinkinben@outlook.com>
@sinkinben sinkinben marked this pull request as ready for review August 17, 2022 08:19
* Fixed bug when all elements in ColumnVector are null
* Updated IT tests

Signed-off-by: sinkinben <sinkinben@outlook.com>
@sinkinben sinkinben self-assigned this Aug 18, 2022
@sinkinben sinkinben requested a review from revans2 August 29, 2022 09:23
@@ -864,6 +864,16 @@ object RapidsConf {
.booleanConf
.createWithDefault(true)

val ENABLE_ORC_FLOAT_TYPES_TO_STRING =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some docs to this that indicate what will happen if we run into this situation. For most configs when we are in this kind of a situation we fall back to the CPU, but here we will throw an exception and the job will fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated this in config.md.

// We let a conf 'spark.rapids.sql.format.orc.floatTypesToString.enable' to control it's
// enable or not.
case (DType.FLOAT32 | DType.FLOAT64, DType.STRING) =>
GpuCast.castFloatingTypeToString(col)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please file a follow on issue for us to go back an see what we can do to fix this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please file a follow on issue for us to go back an see what we can do to fix this?

Ok, after merging this, I will file an issue to describe this problem.


# When casting float/double to double/float, we need to compare values of GPU with CPU
# in an approximate way.
@pytest.mark.approximate_float
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we off if we don't do this? It feels odd that we would get a different answer.

Copy link
Contributor Author

@sinkinben sinkinben Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we off if we don't do this? It feels odd that we would get a different answer.

Yep, it's okay to remove approximate_float, we can still pass the test.

But I think we should pay attention to the method of comparing float types numbers whether if they are equal.

For example,

scala> var k = (3.14).toFloat
var k: Float = 3.14                                                                                                                 
scala> k.toDouble
val res3: Double = 3.140000104904175

I don't know whether if the conversion float -> double in GPU is same as CPU.

We should check two float types numbers if they're equal via abs(val1 - val2) < EPSLION, where EPSILON is the allowable accuracy error.

@sinkinben sinkinben changed the title Support float/double castings for ORC reading. Support float/double castings for ORC reading [databricks] Aug 31, 2022
@revans2
Copy link
Collaborator

revans2 commented Sep 6, 2022

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support float/double -> {bool, integer types, double/float, string, timestamp} for ORC reading.
5 participants