Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Distinct count of floating point values differs with regular spark #837

Closed
revans2 opened this issue Sep 23, 2020 · 10 comments · Fixed by #1412
Closed

[BUG] Distinct count of floating point values differs with regular spark #837

revans2 opened this issue Sep 23, 2020 · 10 comments · Fixed by #1412
Labels
bug Something isn't working P0 Must have for release

Comments

@revans2
Copy link
Collaborator

revans2 commented Sep 23, 2020

Describe the bug
if I do a COUNT(DISTINCT a) where a is a double or floating point value it will produce an incorrect result if -0.0 and or NaN values are included in it

Steps/Code to reproduce bug

@pytest.mark.parametrize('data_gen', [float_gen, double_gen], ids=idfn)
def test_distinct_float_count_reductions(data_gen):
    assert_gpu_and_cpu_are_equal_collect(
            lambda spark : binary_op_df(spark, data_gen).selectExpr(
                'count(DISTINCT a)'))

Expected behavior
We should get the same answer as spark does, even if it has -0.0 different from 0.0 and all of the NaNs different from each other.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 23, 2020
@revans2
Copy link
Collaborator Author

revans2 commented Sep 23, 2020

At a minimum we need to document this and probably reused the hasNans config that we use to restrict min/max floating point aggregations.

@kuhushukla
Copy link
Collaborator

Thanks for the report. I would think the normalization of nans and zeroes code could have taken care of this. Should we target this for 0.3?

@revans2
Copy link
Collaborator Author

revans2 commented Sep 23, 2020

When I look at the plan there is no normalization on a reduction. As such I would argue that it is a bug in spark and were might want to file something against them. I don't think we can fully fix this without support from cudf. So for now, I would like to see it documented and disabled by default in 0.3, but we need to keep this or another one open to figure out what the final solution is if it is not considered a bug in spark.

@revans2 revans2 added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Sep 23, 2020
@revans2
Copy link
Collaborator Author

revans2 commented Sep 23, 2020

@sameerz do you agree that mitigating this should be a P1 in 0.3?

@kuhushukla
Copy link
Collaborator

As such I would argue that it is a bug in spark and were might want to file something against them. I don't think we can fully fix this without support from cudf. So for now, I would like to see it documented and disabled by default in 0.3, but we need to keep this or another one open to figure out what the final solution is if it is not considered a bug in spark.

+1. IMHO this is P1 for 0.3

@revans2
Copy link
Collaborator Author

revans2 commented Nov 20, 2020

I will start working on a PR to mitigate this in our plugin (have it off by default with a config to enable it again). I will also take a look at doing something in spark to avoid the issue in the future.

@revans2
Copy link
Collaborator Author

revans2 commented Nov 23, 2020

I did some more testing and Spark is self consistent with COUNT DISTINCT. It is still a hot mess, but -0.0 and 0.0 are always different, as are all of the different types of NaN values. However, I am beginning to doubt that we need to do more than just document this right now. We have issues with comparison operators already, when it comes to -0.0 (see #294). I think we could fix them, and this too with a modified cudf logical_cast implementation, that allows treating a float as an INT, or if we wrote our own bit_cast that would do it. That is a little beside the point because #294 is currently a P3, and I am having a really hard time justifying why we want to disable this operation for floating point when we don't do it for the comparison operators. Part of me says that we are more likely to hit this situation because we are comparing more data so the probability is higher, but an incorrect result is an incorrect result. So I would like some feedback from @sameerz, @jlowe, and @tgravescs.

My proposal right now is to document this and then file a follow on issue, or possibly co-opt #294 to have us support the logical cast operation we need in CUDF and then use it when necessary to get our implementation to match cudf. We probably could even do it for sort, but it might be more difficult to get right.

@jlowe
Copy link
Member

jlowe commented Nov 23, 2020

My proposal right now is to document this and then file a follow on issue, or possibly co-opt #294 to have us support the logical cast operation we need in CUDF and then use it when necessary to get our implementation to match cudf.

This sounds fine to me.

@revans2
Copy link
Collaborator Author

revans2 commented Nov 23, 2020

I have filed rapidsai/cudf#6834 so we can work around some of the issues ourselves. If we treat float values as just an int or long we should be able to do a distinct count and get the same answer as Spark. We should also be able to do some special bit wise operations and work around some of the issues in #294 (sorting and comparisons)

@revans2
Copy link
Collaborator Author

revans2 commented Dec 17, 2020

Spark changed behavior in 3.1.0

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#837)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants