Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Null Handling in where #9328

Closed
quasiben opened this issue Sep 28, 2021 · 8 comments · Fixed by #11276
Closed

[BUG] Null Handling in where #9328

quasiben opened this issue Sep 28, 2021 · 8 comments · Fixed by #11276
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@quasiben
Copy link
Member

Null handling in where does not reflect Pandas Null handling:

In [20]: sr = pd.Series([0, 1, 2, None])

In [21]: sr.where(~(sr > 1), -1)
Out[21]:
0    0.0
1    1.0
2   -1.0
3    NaN
dtype: float64

In [22]: sr = cudf.Series([0, 1, 2, None])

In [23]: sr.where(~(sr > 1), -1)
Out[23]:
0    0
1    1
2   -1
3   -1
dtype: int64

In the above Nan is always treat as a value less than any int/float, etc.

cc @randerzander

@quasiben quasiben added bug Something isn't working Needs Triage Need team to review and classify labels Sep 28, 2021
@beckernick
Copy link
Member

beckernick commented Sep 28, 2021

This feels like a nullable dtype discrepancy or perhaps pandas/our documentation is in need of an update?

pd.Series.where

Where cond is True, keep the original value. Where False, replace with corresponding value from other.

~(sr > 1) should resolve to [True, True, False, NA] in cuDF and in pandas with nullable types. Without nullable types, we should get [True, True, False, True] and thus keep the Nan.

If sr.where only replaces where the mask is False replaces where cond is != True (including if cond is null), this may be oddly behaving as intended?

@beckernick beckernick added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Sep 28, 2021
@charlesbluca
Copy link
Member

Looks like you were correct @beckernick; the nulls are replaced after a call to cudf._lib.copying.copy_if_else, and looking at the docstring over in the header file:

* @param[in] boolean_mask column of `type_id::BOOL8` representing "left (true) / right (false)"
* boolean for each element. null element represents false.

A simple workaround here would be to simply assign None to the null values of the copy_if_else result immediately afterwards, though I'm interested in why this is the behavior in libcudf, and if it would be better to implement this fix there.

@brandon-b-miller
Copy link
Contributor

We're equivalent to pandas nullable types here:

sr = pd.Series([0, 1, 2, None], dtype='Int64')
>>> sr
0       0
1       1
2       2
3    <NA>
dtype: Int64
>>> sr.where(~(sr > 1), -1)
0     0
1     1
2    -1
3    -1
dtype: Int64

@charlesbluca
Copy link
Member

In that case, is the larger issue here that Pandas series still infer non-nullable types, while cuDF infers nullable? Based on Pandas docs, it seems like the recommendation here would be to explicitly provide dtype like in @brandon-b-miller's example:

https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

Although I do think it would be nice in the docs of both libraries to emphasize that where treats pd.NA as False.

@brandon-b-miller
Copy link
Contributor

Yes - I think that is what is happening. sr is different, I think where is consistent. I'm not quite sure what to do about this yet but will spend some cycles thinking about it today.

@quasiben
Copy link
Member Author

Thanks for jumping in @charlesbluca and @brandon-b-miller . A doc update might be the resolution here

@charlesbluca charlesbluca removed their assignment Sep 30, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants