[BUG] Distinct count of floating point values differs with regular spark #837

revans2 · 2020-09-23T17:44:23Z

Describe the bug
if I do a COUNT(DISTINCT a) where a is a double or floating point value it will produce an incorrect result if -0.0 and or NaN values are included in it

Steps/Code to reproduce bug

@pytest.mark.parametrize('data_gen', [float_gen, double_gen], ids=idfn)
def test_distinct_float_count_reductions(data_gen):
    assert_gpu_and_cpu_are_equal_collect(
            lambda spark : binary_op_df(spark, data_gen).selectExpr(
                'count(DISTINCT a)'))

Expected behavior
We should get the same answer as spark does, even if it has -0.0 different from 0.0 and all of the NaNs different from each other.

The text was updated successfully, but these errors were encountered:

revans2 · 2020-09-23T17:45:09Z

At a minimum we need to document this and probably reused the hasNans config that we use to restrict min/max floating point aggregations.

kuhushukla · 2020-09-23T17:48:15Z

Thanks for the report. I would think the normalization of nans and zeroes code could have taken care of this. Should we target this for 0.3?

revans2 · 2020-09-23T18:09:08Z

When I look at the plan there is no normalization on a reduction. As such I would argue that it is a bug in spark and were might want to file something against them. I don't think we can fully fix this without support from cudf. So for now, I would like to see it documented and disabled by default in 0.3, but we need to keep this or another one open to figure out what the final solution is if it is not considered a bug in spark.

revans2 · 2020-09-23T18:09:55Z

@sameerz do you agree that mitigating this should be a P1 in 0.3?

kuhushukla · 2020-09-23T18:14:50Z

As such I would argue that it is a bug in spark and were might want to file something against them. I don't think we can fully fix this without support from cudf. So for now, I would like to see it documented and disabled by default in 0.3, but we need to keep this or another one open to figure out what the final solution is if it is not considered a bug in spark.

+1. IMHO this is P1 for 0.3

revans2 · 2020-11-20T21:16:13Z

I will start working on a PR to mitigate this in our plugin (have it off by default with a config to enable it again). I will also take a look at doing something in spark to avoid the issue in the future.

revans2 · 2020-11-23T14:44:06Z

I did some more testing and Spark is self consistent with COUNT DISTINCT. It is still a hot mess, but -0.0 and 0.0 are always different, as are all of the different types of NaN values. However, I am beginning to doubt that we need to do more than just document this right now. We have issues with comparison operators already, when it comes to -0.0 (see #294). I think we could fix them, and this too with a modified cudf logical_cast implementation, that allows treating a float as an INT, or if we wrote our own bit_cast that would do it. That is a little beside the point because #294 is currently a P3, and I am having a really hard time justifying why we want to disable this operation for floating point when we don't do it for the comparison operators. Part of me says that we are more likely to hit this situation because we are comparing more data so the probability is higher, but an incorrect result is an incorrect result. So I would like some feedback from @sameerz, @jlowe, and @tgravescs.

My proposal right now is to document this and then file a follow on issue, or possibly co-opt #294 to have us support the logical cast operation we need in CUDF and then use it when necessary to get our implementation to match cudf. We probably could even do it for sort, but it might be more difficult to get right.

jlowe · 2020-11-23T15:10:46Z

My proposal right now is to document this and then file a follow on issue, or possibly co-opt #294 to have us support the logical cast operation we need in CUDF and then use it when necessary to get our implementation to match cudf.

This sounds fine to me.

revans2 · 2020-11-23T17:09:34Z

I have filed rapidsai/cudf#6834 so we can work around some of the issues ourselves. If we treat float values as just an int or long we should be able to do a distinct count and get the same answer as Spark. We should also be able to do some special bit wise operations and work around some of the issues in #294 (sorting and comparisons)

revans2 · 2020-12-17T13:47:43Z

Spark changed behavior in 3.1.0

…IDIA#837) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com> Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 23, 2020

revans2 added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Sep 23, 2020

revans2 mentioned this issue Sep 23, 2020

First/Last reduction and cleanup of agg APIs #839

Merged

sameerz added this to the Nov 9 - Nov 20 milestone Oct 23, 2020

sameerz modified the milestones: Nov 9 - Nov 20, Nov 23 - Dec 4 Nov 23, 2020

revans2 mentioned this issue Nov 23, 2020

Updated documentation for distinct count compatibility #1187

Merged

revans2 removed this from the Nov 23 - Dec 4 milestone Nov 23, 2020

revans2 mentioned this issue Dec 16, 2020

Fix a lot of tests marked with xfail for Spark 3.1.0 that no longer fail #1412

Merged

revans2 closed this as completed in #1412 Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Distinct count of floating point values differs with regular spark #837

[BUG] Distinct count of floating point values differs with regular spark #837

revans2 commented Sep 23, 2020

revans2 commented Sep 23, 2020

kuhushukla commented Sep 23, 2020

revans2 commented Sep 23, 2020

revans2 commented Sep 23, 2020

kuhushukla commented Sep 23, 2020

revans2 commented Nov 20, 2020

revans2 commented Nov 23, 2020

jlowe commented Nov 23, 2020

revans2 commented Nov 23, 2020

revans2 commented Dec 17, 2020

[BUG] Distinct count of floating point values differs with regular spark #837

[BUG] Distinct count of floating point values differs with regular spark #837

Comments

revans2 commented Sep 23, 2020

revans2 commented Sep 23, 2020

kuhushukla commented Sep 23, 2020

revans2 commented Sep 23, 2020

revans2 commented Sep 23, 2020

kuhushukla commented Sep 23, 2020

revans2 commented Nov 20, 2020

revans2 commented Nov 23, 2020

jlowe commented Nov 23, 2020

revans2 commented Nov 23, 2020

revans2 commented Dec 17, 2020