-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Distinct count of floating point values differs with regular spark #837
Comments
At a minimum we need to document this and probably reused the hasNans config that we use to restrict min/max floating point aggregations. |
Thanks for the report. I would think the normalization of nans and zeroes code could have taken care of this. Should we target this for 0.3? |
When I look at the plan there is no normalization on a reduction. As such I would argue that it is a bug in spark and were might want to file something against them. I don't think we can fully fix this without support from cudf. So for now, I would like to see it documented and disabled by default in 0.3, but we need to keep this or another one open to figure out what the final solution is if it is not considered a bug in spark. |
@sameerz do you agree that mitigating this should be a P1 in 0.3? |
+1. IMHO this is P1 for 0.3 |
I will start working on a PR to mitigate this in our plugin (have it off by default with a config to enable it again). I will also take a look at doing something in spark to avoid the issue in the future. |
I did some more testing and Spark is self consistent with My proposal right now is to document this and then file a follow on issue, or possibly co-opt #294 to have us support the logical cast operation we need in CUDF and then use it when necessary to get our implementation to match cudf. We probably could even do it for sort, but it might be more difficult to get right. |
This sounds fine to me. |
I have filed rapidsai/cudf#6834 so we can work around some of the issues ourselves. If we treat float values as just an int or long we should be able to do a distinct count and get the same answer as Spark. We should also be able to do some special bit wise operations and work around some of the issues in #294 (sorting and comparisons) |
Spark changed behavior in 3.1.0 |
…IDIA#837) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com> Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Describe the bug
if I do a
COUNT(DISTINCT a)
where a is a double or floating point value it will produce an incorrect result if -0.0 and or NaN values are included in itSteps/Code to reproduce bug
Expected behavior
We should get the same answer as spark does, even if it has -0.0 different from 0.0 and all of the NaNs different from each other.
The text was updated successfully, but these errors were encountered: