-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] nullability on some decimal operations is wrong #9108
Comments
Hi, does the |
Is it because we tell Spark the data is not nullable but there are nulls in rows ? Then Spark will not check the nullability of data by calling |
No it means the input to one of the decimal operations. Turns out this can cause data corruption too :(.
If you run it on the CPU you get a null as the output for the columns that would overflow the add, but if you run it on the GPU you get an empty string out. But if you do a collect instead of a show you get bad results back instead of nulls.
This happens for all of the binary ops in some form.
|
Thanks a lot. I also figured out a repro case, but yours look simpler. And my case here has data corruption for all the Spark versions, which will be also fixed by my PR.
|
Describe the bug
There are some expressions where we set nullable incorrectly.
CheckOverflow
always has it set to true, but we have it set tochild.nullable
an inherited by unary.In https://issues.apache.org/jira/browse/SPARK-39316 which made some changes to how arithmetic operations are calculated for decimal in changed the default nullable implementation for BinaryArithmatic
https://github.com/apache/spark/blob/ce12f6dbad2d713c6a2a52ac36d1f7a910399ad4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala#L253-L263
Ours no longer matches this and still does
left.nullable || right.nullable
. This can result in crashes if we do a ColumnarToRow operation and it is not expected to be null.We should probably audit this for all expressions and see if we are setting it correctly.
Steps/Code to reproduce bug
The good news is that it is kind of hard to reproduce. You need to overflow the decimal operation on one of these, and have the input marked as not-nullable. But Spark tends to mark all inputs as nullable unless it is a constant or comes from
spark.range
so it would be good to come up with a way to test this generally, but I don't know of one right now.The text was updated successfully, but these errors were encountered: