Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dataframe and Index nunique #10077

Merged
merged 4 commits into from
Feb 2, 2022

Conversation

martinfalisse
Copy link
Contributor

@martinfalisse martinfalisse commented Jan 19, 2022

Add Dataframe and Index nunique. Resolves #9611

@martinfalisse martinfalisse requested a review from a team as a code owner January 19, 2022 19:09
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@github-actions github-actions bot added the Python Affects Python cuDF API. label Jan 19, 2022
@martinfalisse martinfalisse changed the title Add Dataframe and Index nunique Add Dataframe and Index nunique :improvement :non-breaking Jan 19, 2022
@martinfalisse martinfalisse changed the title Add Dataframe and Index nunique :improvement :non-breaking Add Dataframe and Index nunique Jan 19, 2022
@vyasr
Copy link
Contributor

vyasr commented Jan 19, 2022

Hi @martinfalisse thanks for the contribution! Could you retarget this PR to 22.04? We are no longer accepting new changes on 22.02 since we're in the process of finalizing a release.

@vyasr vyasr added 3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change labels Jan 19, 2022
@martinfalisse martinfalisse changed the base branch from branch-22.02 to branch-22.04 January 20, 2022 07:26
@martinfalisse
Copy link
Contributor Author

martinfalisse commented Jan 20, 2022

Hi @martinfalisse thanks for the contribution! Could you retarget this PR to 22.04? We are no longer accepting new changes on 22.02 since we're in the process of finalizing a release.

Sure, sorry about that, should be fixed now.

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
Co-authored-by: Ashwin Srinath <3190405+shwina@users.noreply.github.com>
Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thank you @martinfalisse

@beckernick
Copy link
Member

Thanks for the contribution @martinfalisse !

May I ask how you decided to work on #9611 vs another issue?

@martinfalisse
Copy link
Contributor Author

Thanks for the contribution @martinfalisse !

May I ask how you decided to work on #9611 vs another issue?

Sure! I searched for the issues under the category 'good first issue', and then picked one that I felt I could implement. This specific issue seemed to only require putting some things together on the python side which is why I chose it.

Hope to contribute more in the future!

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost there! Thanks for the work, the changes are looking good.

python/cudf/cudf/core/series.py Show resolved Hide resolved
python/cudf/cudf/core/single_column_frame.py Show resolved Hide resolved
python/cudf/cudf/tests/test_dataframe.py Outdated Show resolved Hide resolved
@ajschmidt8
Copy link
Member

ok to test

@vyasr
Copy link
Contributor

vyasr commented Feb 2, 2022

@martinfalisse you're almost there! The code looks great now, but the style checks are failing. The easiest way for you to fix this is to use pre-commit. If you haven't used it before, you can pip install pre-commit, then run pre-commit run --all-files inside the cudf repo. It should automatically fix as many issues as it can, and you can read the other errors to see what's failing. From a quick look it does appear that fixing the outstanding issues will require some manual intervention, so feel free to ping back here if you have trouble addressing them!

As an aside, you can run pre-commit install inside the repo once you have pre-commit on your machine and it will install git hooks that will run our style checks every time you try to commit. That will ensure that every commit you make will pass our tests.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the great work.

For future reference, please don't force push changes once reviews have started. It can erase enough history to make comments vanish and cause other problems. It's easier to just add commits, and we squash all commits when a PR is merged anyway so you don't need to worry about keeping the history clean.

@codecov
Copy link

codecov bot commented Feb 2, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.04@b217d7e). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-22.04   #10077   +/-   ##
===============================================
  Coverage                ?   10.47%           
===============================================
  Files                   ?      122           
  Lines                   ?    20506           
  Branches                ?        0           
===============================================
  Hits                    ?     2148           
  Misses                  ?    18358           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b217d7e...bcd84ad. Read the comment docs.

@vyasr
Copy link
Contributor

vyasr commented Feb 2, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b7257a3 into rapidsai:branch-22.04 Feb 2, 2022
rapids-bot bot pushed a commit that referenced this pull request Feb 3, 2022
Removes unnecessary nunique function in Series, as indicated by @vyasr here #10077 (comment).

Seem to have made a mistake when correcting the style of my last commit and not have kept the original changes of that commit.

Authors:
  - https://github.com/martinfalisse

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10205
rapids-bot bot pushed a commit that referenced this pull request Mar 11, 2022
…0411)

The APIs `DataFrame.nunique`, `Series.nunique`, and `ColumnBase.distinct_count` have a `method` argument. The pandas API for `nunique` [does not have a `method` argument](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html). This `method` argument was meant to distinguish distinct counts computed by sort/hash approaches, but there are two issues: only `"sort"` is supported as an input value, and the algorithm in libcudf actually uses a _hash-based approach_. To resolve this inconsistency and align with the pandas API, I have removed the `method` parameter from `DataFrame.nunique`, `Series.nunique`, and `ColumnBase.distinct_count`.

This is a breaking change but I don't think a deprecation is needed. The `nunique` feature was added in 22.04 via #10077, so it has not yet been released. The internal changes to `distinct_count` can be made without a deprecation.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #10411
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] DataFrame and Index nunique
6 participants