Fix DataFrame.drop(columns=cudf.Series/Index, axis=1) #16712

mroeschke · 2024-08-30T19:25:52Z

Description

Before when columns= was a cudf.Series/Index we would call return array.unique.to_pandas(), but .unique is a method not a property so this would have raised an error.

Also took the time to refactor the helper methods here and push down the errors= keyword to Frame._drop_column

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

bdice · 2024-09-03T15:55:59Z

python/cudf/cudf/core/indexed_frame.py

-        return array.unique.to_pandas()
-    elif isinstance(array, (str, numbers.Number)):
-        return [array]
+        yield from as_column(array).unique().values_host


Performance question: Do we want to run the unique() on GPU? These are columns and not rows, right? Kernel launch latency may exceed the time to run that unique step on CPU, if we expect this to be small.

I'm okay with running it on GPU if there's any uncertainty or if there's case-by-case decisions/tradeoffs we would need to consider, just want to be sure we're not making a uniformly bad performance decision.

These are columns and not rows, right?

Yeah these are just column labels to drop that happen to be GPU backed.

I agree it might be worth doing this on the CPU instead. I'd assume len(columns to drop) << len(columns) and we convert to host anyways to iterate over these labels to drop, so we might as well do the unique step there too

bdice · 2024-09-03T15:59:12Z

python/cudf/cudf/core/indexed_frame.py

@@ -150,24 +149,14 @@
 )


-def _get_host_unique(array):
+def _get_unique_drop_labels(array):


What's the benefit of this being a generator? You could just return an iterable rather than yield from it if that makes sense.

Probably negligible in the context of .drop, but it was to avoid a case where array was a scalar so we were converting scalar -> iterable (_get_unique_drop_labels) -> scalar (frame._drop_column(scalar)). I can change it back to make this _get_unique_drop_labels return an iterable if preferred.

I’ll leave the choice to you. Just noting that yield from patterns tend to be dangerous for performance in cudf because host-device data copying is often involved.

…lumn

bdice · 2024-09-25T18:35:35Z

/merge

Before when `columns=` was a `cudf.Series/Index` we would call `return array.unique.to_pandas()`, but `.unique` is a method not a property so this would have raised an error. Also took the time to refactor the helper methods here and push down the `errors=` keyword to `Frame._drop_column` Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Bradley Dice (https://github.com/bdice) URL: rapidsai#16712

mroeschke added 2 commits August 30, 2024 12:20

Fix DataFrame.drop(columns=cudf.Series/Index, axis=1)

23645d8

Use != ignore to catch invalid inputs

33d140d

mroeschke added bug Something isn't working improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 30, 2024

mroeschke requested a review from a team as a code owner August 30, 2024 19:25

mroeschke requested review from isVoid and charlesbluca August 30, 2024 19:25

github-actions bot added the Python Affects Python cuDF API. label Aug 30, 2024

mroeschke removed the improvement Improvement / enhancement to an existing function label Aug 30, 2024

bdice reviewed Sep 3, 2024

View reviewed changes

mroeschke added 3 commits September 3, 2024 10:46

Merge remote-tracking branch 'upstream/branch-24.10' into bug/drop/co…

82cb70f

…lumn

do unique via numpy instead

e35fca2

Merge remote-tracking branch 'upstream/branch-24.10' into bug/drop/co…

e3d17bc

…lumn

bdice approved these changes Sep 25, 2024

View reviewed changes

rapids-bot bot merged commit b92d008 into rapidsai:branch-24.10 Sep 25, 2024
83 checks passed

mroeschke deleted the bug/drop/column branch September 25, 2024 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DataFrame.drop(columns=cudf.Series/Index, axis=1) #16712

Fix DataFrame.drop(columns=cudf.Series/Index, axis=1) #16712

mroeschke commented Aug 30, 2024

bdice Sep 3, 2024

mroeschke Sep 3, 2024

bdice Sep 3, 2024

mroeschke Sep 3, 2024

bdice Sep 3, 2024

bdice commented Sep 25, 2024

Fix DataFrame.drop(columns=cudf.Series/Index, axis=1) #16712

Fix DataFrame.drop(columns=cudf.Series/Index, axis=1) #16712

Conversation

mroeschke commented Aug 30, 2024

Description

Checklist

bdice Sep 3, 2024

Choose a reason for hiding this comment

mroeschke Sep 3, 2024

Choose a reason for hiding this comment

bdice Sep 3, 2024

Choose a reason for hiding this comment

mroeschke Sep 3, 2024

Choose a reason for hiding this comment

bdice Sep 3, 2024

Choose a reason for hiding this comment

bdice commented Sep 25, 2024