Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DataFrame.drop(columns=cudf.Series/Index, axis=1) #16712

Merged
merged 5 commits into from
Sep 25, 2024

Conversation

mroeschke
Copy link
Contributor

Description

Before when columns= was a cudf.Series/Index we would call return array.unique.to_pandas(), but .unique is a method not a property so this would have raised an error.

Also took the time to refactor the helper methods here and push down the errors= keyword to Frame._drop_column

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added bug Something isn't working improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 30, 2024
@mroeschke mroeschke requested a review from a team as a code owner August 30, 2024 19:25
@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 30, 2024
@mroeschke mroeschke removed the improvement Improvement / enhancement to an existing function label Aug 30, 2024
return array.unique.to_pandas()
elif isinstance(array, (str, numbers.Number)):
return [array]
yield from as_column(array).unique().values_host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance question: Do we want to run the unique() on GPU? These are columns and not rows, right? Kernel launch latency may exceed the time to run that unique step on CPU, if we expect this to be small.

I'm okay with running it on GPU if there's any uncertainty or if there's case-by-case decisions/tradeoffs we would need to consider, just want to be sure we're not making a uniformly bad performance decision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are columns and not rows, right?

Yeah these are just column labels to drop that happen to be GPU backed.

I agree it might be worth doing this on the CPU instead. I'd assume len(columns to drop) << len(columns) and we convert to host anyways to iterate over these labels to drop, so we might as well do the unique step there too

@@ -150,24 +149,14 @@
)


def _get_host_unique(array):
def _get_unique_drop_labels(array):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of this being a generator? You could just return an iterable rather than yield from it if that makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably negligible in the context of .drop, but it was to avoid a case where array was a scalar so we were converting scalar -> iterable (_get_unique_drop_labels) -> scalar (frame._drop_column(scalar)). I can change it back to make this _get_unique_drop_labels return an iterable if preferred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll leave the choice to you. Just noting that yield from patterns tend to be dangerous for performance in cudf because host-device data copying is often involved.

@bdice
Copy link
Contributor

bdice commented Sep 25, 2024

/merge

@rapids-bot rapids-bot bot merged commit b92d008 into rapidsai:branch-24.10 Sep 25, 2024
83 checks passed
@mroeschke mroeschke deleted the bug/drop/column branch September 25, 2024 18:36
rjzamora pushed a commit to rjzamora/cudf that referenced this pull request Sep 25, 2024
Before when `columns=` was a `cudf.Series/Index` we would call `return array.unique.to_pandas()`, but `.unique` is a method not a property so this would have raised an error.

Also took the time to refactor the helper methods here and push down the `errors=` keyword to `Frame._drop_column`

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#16712
Matt711 pushed a commit to mroeschke/cudf that referenced this pull request Sep 25, 2024
Before when `columns=` was a `cudf.Series/Index` we would call `return array.unique.to_pandas()`, but `.unique` is a method not a property so this would have raised an error.

Also took the time to refactor the helper methods here and push down the `errors=` keyword to `Frame._drop_column`

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#16712
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants