[BUG] cudf.testing.assert_series_equal not corrrectly testing for NaNs and assigning cp.nan to Series the same. #8513

esnvidia · 2021-06-14T18:29:08Z

Describe the bug
cudf.testing.assert_series_equal gives assertion error when comparing two series that have cp.nan filled in

Steps/Code to reproduce bug

import cupy as cp
import cudf

data= cp.array([[1.5, 0], [0.6, 1], [2.5, 0]])
expect = cudf.Series([1.5, cp.nan, 2.5])

isnull = data[:, 1] > 0.5  # get null mask
data = cudf.Series(data[:,0])

# apply null mask
data.iloc[isnull] = cp.nan

data
Out:
0    1.5
1    NaN
2    2.5
dtype: float64

cudf.testing.assert_series_equal(data, expect, 'Error in Series equal') -->

AssertionError: ColumnBase are different
values are different (0.0 %)
[left]:  [1.5 nan 2.5]
[right]: [1.5 2.5]

Notice that in [right] the NaN is missing. Also notice in the section below, that the cudf has two types of NaN: <NA> and NaN. despite both assigned cp.nan

Expected behavior
No error. Feel free to try the pandas alternative:

pd.testing.assert_series_equal(data.to_pandas(),  expect.to_pandas(), 'Error in Series equal')  --> no assertionerror

data.to_pandas()
Out:
0    1.5
1    NaN
2    2.5
dtype: float64

expect.to_pandas()
Out:
0    1.5
1    NaN
2    2.5
dtype: float64

data
Out:
0    1.5
1    NaN
2    2.5
dtype: float64

expect
Out:
0     1.5
1    <NA>
2     2.5
dtype: float64

Environment overview (please complete the following information)
Rapids 0.19 install via conda

The text was updated successfully, but these errors were encountered:

quasiben · 2021-06-14T20:36:33Z

@brandon-b-miller is this something which could be resolved by #8442 ?

brandon-b-miller · 2021-06-14T20:43:05Z

Not sure - I suspect no, but am digging in now.

brandon-b-miller · 2021-06-14T21:32:18Z

slightly simpler repro

import cupy as cp
import cudf

data = cudf.Series(cp.array([0.0, 0.0, 0.0]))
expect = cudf.Series([0.0, cp.nan, 0.0])

data.iloc[1] = cp.nan

cudf.testing.assert_series_equal(data, expect, 'Error in Series equal')

brandon-b-miller · 2021-06-14T21:55:49Z

This is a combination of two things:

when we construct from a cupy array with nans in it, we coerce those to nulls. This behavior is debatable.
cudf.testing.assert_series_equal uses to_array() to show the array differences, which is possibly incorrectly filtering the nulls. What to do with nulls during to_array() is also debatable. In any event though, it can at least be changed to show the actual reason the arrays are different (one has a null and the other has a nan) rather than erroneously erroring the way it currently does.

esnvidia · 2021-06-15T00:11:46Z

Here's my (likely slower) workaround:

from typing import Optional

def assert_series_equal(df1: cudf.Series, df2: cudf.Series, msg: Optional[str]=None):
    if not msg:
        msg = ''
    if (len(df1) != len(df2)) or ((df1.index == df2.index).sum() != len(df1)):
        # check lengths equal and indices align
        raise AssertionError('df1 and df2 indices do not match', msg)
    elif (df1.isna() == df2.isna()).sum() != len(df1):
        # check all nulls equal and all non-nulls equal a non-null.
        raise AssertionError('df1 and df2 NaNs do not match', msg)
    elif (df1.loc[~df1.isna()] != df2.loc[~df2.isna()]).any():
        # check all non nulls equal
        raise AssertionError('values in df1 do not match df2', msg)

feel free to test with the following:

v = cudf.Series(cp.array([1.5, cp.nan]))
w = cudf.Series([3.14, 0, 2.5])
x = cudf.Series([1.5, 0, 2.5])
y = cudf.Series([1.5, cp.nan, 2.5])
z = cudf.Series(cp.array([1.5, 0.5, 2.5 ]))
z.iloc[1] = cp.nan

assert_series_equal(y,z) and assert_series_equal(z,y) and \
 assert_series_equal(v, z[:2]) and assert_series_equal(w[1:], x[1:])

and similar and reverse cases should be the only ones that work.

github-actions · 2021-11-15T21:03:42Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Fixes a bug where empty columns were not comparing correctly as well as a few edge cases with strings Partially addresses #8513 Authors: - https://github.com/brandon-b-miller Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Michael Wang (https://github.com/isVoid) URL: #10011

vyasr · 2022-08-01T23:20:46Z

@brandon-b-miller of your two points above, item (2) is no longer relevant after #10011.

Regarding item 1, don't a lot of invariants break if we change the Series constructor to default to nan_as_null=False? More generally, isn't the default chosen because we assume based on empirical results that users typically have NaNs in pandas objects because they're representing nulls? The above example works exactly as expected if you construct expect with

expect = cudf.Series([1.5, cp.nan, 2.5], nan_as_null=False)

I don't think we can change the default behavior of nan_as_null without breaking many more users than we would help. CC @shwina in case he has a different opinion, but otherwise I don't think there's anything to be done aside from perhaps finding a better way to document how we ingest nan data in the constructors beyond just the API docs.

brandon-b-miller · 2022-08-02T13:16:20Z

Yeah, I'd agree. Over the long term I'd rather not have nan_as_null at all since it conflates things that are distinct, but you're right, that's the way it is right now due to pandas defaults. The day that pandas nullable types are standard would probably be the right time to do away with it.

vyasr · 2022-08-02T16:24:06Z

Sounds good. We can revisit removing nan_as_null when pandas 2.0 is released (I think proper nullable type support is on the roadmap for 2.0, but if not then it'll be later), but for now we can close this issue then.

esnvidia added Needs Triage Need team to review and classify bug Something isn't working labels Jun 14, 2021

quasiben added the Python Affects Python cuDF API. label Jun 14, 2021

brandon-b-miller self-assigned this Jun 14, 2021

brandon-b-miller removed the Needs Triage Need team to review and classify label Jun 14, 2021

github-actions bot added the inactive-90d label Nov 15, 2021

brandon-b-miller mentioned this issue Jan 11, 2022

Column equality testing fixes #10011

Merged

vyasr closed this as completed Aug 2, 2022

AntiKnot mentioned this issue Aug 23, 2024

[BUG] cudf.testing.assert_*_equal raises AssertionError for equivalent DecimalDtyped objects #16635

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cudf.testing.assert_series_equal not corrrectly testing for NaNs and assigning cp.nan to Series the same. #8513

[BUG] cudf.testing.assert_series_equal not corrrectly testing for NaNs and assigning cp.nan to Series the same. #8513

esnvidia commented Jun 14, 2021

quasiben commented Jun 14, 2021

brandon-b-miller commented Jun 14, 2021

brandon-b-miller commented Jun 14, 2021

brandon-b-miller commented Jun 14, 2021

esnvidia commented Jun 15, 2021 •

edited

Loading

github-actions bot commented Nov 15, 2021

vyasr commented Aug 1, 2022

brandon-b-miller commented Aug 2, 2022

vyasr commented Aug 2, 2022

[BUG] cudf.testing.assert_series_equal not corrrectly testing for NaNs and assigning cp.nan to Series the same. #8513

[BUG] cudf.testing.assert_series_equal not corrrectly testing for NaNs and assigning cp.nan to Series the same. #8513

Comments

esnvidia commented Jun 14, 2021

quasiben commented Jun 14, 2021

brandon-b-miller commented Jun 14, 2021

brandon-b-miller commented Jun 14, 2021

brandon-b-miller commented Jun 14, 2021

esnvidia commented Jun 15, 2021 • edited Loading

github-actions bot commented Nov 15, 2021

vyasr commented Aug 1, 2022

brandon-b-miller commented Aug 2, 2022

vyasr commented Aug 2, 2022

esnvidia commented Jun 15, 2021 •

edited

Loading