-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DataFrame.drop(columns=cudf.Series/Index, axis=1) #16712
Changes from 2 commits
23645d8
33d140d
82cb70f
e35fca2
e3d17bc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,7 +3,6 @@ | |
|
||
from __future__ import annotations | ||
|
||
import numbers | ||
import operator | ||
import textwrap | ||
import warnings | ||
|
@@ -150,24 +149,14 @@ | |
) | ||
|
||
|
||
def _get_host_unique(array): | ||
def _get_unique_drop_labels(array): | ||
"""Return labels to be dropped for IndexFrame.drop.""" | ||
if isinstance(array, (cudf.Series, cudf.Index, ColumnBase)): | ||
return array.unique.to_pandas() | ||
elif isinstance(array, (str, numbers.Number)): | ||
return [array] | ||
yield from as_column(array).unique().values_host | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Performance question: Do we want to run the I'm okay with running it on GPU if there's any uncertainty or if there's case-by-case decisions/tradeoffs we would need to consider, just want to be sure we're not making a uniformly bad performance decision. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah these are just column labels to drop that happen to be GPU backed. I agree it might be worth doing this on the CPU instead. I'd assume |
||
elif is_scalar(array): | ||
yield array | ||
else: | ||
return set(array) | ||
|
||
|
||
def _drop_columns(f: Frame, columns: abc.Iterable, errors: str): | ||
for c in columns: | ||
try: | ||
f._drop_column(c) | ||
except KeyError as e: | ||
if errors == "ignore": | ||
pass | ||
else: | ||
raise e | ||
yield from set(array) | ||
|
||
|
||
def _indices_from_labels(obj, labels): | ||
|
@@ -5261,15 +5250,14 @@ def drop( | |
out = self.copy() | ||
|
||
if axis in (1, "columns"): | ||
target = _get_host_unique(target) | ||
|
||
_drop_columns(out, target, errors) | ||
for label in _get_unique_drop_labels(target): | ||
out._drop_column(label, errors=errors) | ||
elif axis in (0, "index"): | ||
dropped = _drop_rows_by_labels(out, target, level, errors) | ||
|
||
if columns is not None: | ||
columns = _get_host_unique(columns) | ||
_drop_columns(dropped, columns, errors) | ||
for label in _get_unique_drop_labels(columns): | ||
dropped._drop_column(label, errors=errors) | ||
|
||
out._mimic_inplace(dropped, inplace=True) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of this being a generator? You could just return an iterable rather than
yield from
it if that makes sense.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably negligible in the context of
.drop
, but it was to avoid a case wherearray
was a scalar so we were convertingscalar -> iterable (_get_unique_drop_labels) -> scalar (frame._drop_column(scalar))
. I can change it back to make this_get_unique_drop_labels
return an iterable if preferred.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll leave the choice to you. Just noting that
yield from
patterns tend to be dangerous for performance in cudf because host-device data copying is often involved.