Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'spearman' correlation method for dataframe.corr and series.corr #7141

Merged
merged 37 commits into from
Mar 23, 2022

Conversation

dominicshanshan
Copy link
Contributor

@dominicshanshan dominicshanshan commented Jan 14, 2021

Closes #6804

Adds 'spearman' correlation method for dataframe.corr

…rame-add-spearman

[REVIEW]Add 'spearman' correlation matrix in dataframe.py rapidsai#6804
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@isVoid
Copy link
Contributor

isVoid commented Jan 14, 2021

ok to test

Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing! Can you add unit tests for this method to match pandas output?

@isVoid isVoid changed the title [REVIEW]Add 'spearman' correlation matrix in dataframe.py #6804 Add 'spearman' correlation method for dataframe.corr Jan 14, 2021
@isVoid isVoid added Python Affects Python cuDF API. non-breaking Non-breaking change feature request New feature or request 2 - In Progress Currently a work in progress labels Jan 14, 2021
@isVoid isVoid changed the title Add 'spearman' correlation method for dataframe.corr Add 'spearman' correlation method for dataframe.corr Jan 14, 2021
Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for contributing this! In addition to what @isVoid raised, can you make it so that method is a kwarg that defaults to pearson so that we match the pandas API here? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html

@isVoid isVoid added 0 - Waiting on Author Waiting for author to respond to review and removed 2 - In Progress Currently a work in progress labels Jan 15, 2021
@github-actions
Copy link

This PR has been marked stale due to no recent activity in the past 30d. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be marked rotten if there is no activity in the next 60d.

@github-actions github-actions bot added the stale label Feb 16, 2021
@dominicshanshan
Copy link
Contributor Author

dominicshanshan commented Feb 23, 2021

Thanks for contributing! Can you add unit tests for this method to match pandas output?

here is the unit test script for DataFrame output, will added series calculation as identical as Series.corr in pandas API in coming days + unit test with Series.

update: been discussed with @beckernick , modified a bit about unit test, re-upgrade the unit test script

unit_test_spearman_modify.zip

@isVoid
Copy link
Contributor

isVoid commented Feb 24, 2021

@dominicshanshan The unitest looks great! In cudf, we utilize pytest to organize our unit test process. For existing corr function, the coverage is defined in:

def test_df_corr():
gdf = randomdata(100, {str(x): float for x in range(50)})
pdf = gdf.to_pandas()
got = gdf.corr()
expected = pdf.corr()
assert_eq(got, expected)

It's likely you can combine the test_spearman and test_pearson into test_df_corr function by parameterizing the method argument. An example of parameterization is at:

@pytest.mark.parametrize("ddof", range(3))
def test_series_std(ddof):

where test_series_std will run 3 times with ddof parameterized as 0, 1 and 2.

It's also likely you will need

def assert_eq(left, right, **kwargs):

to compare the result. You can specify check_exact=False for float equality.

@dominicshanshan
Copy link
Contributor Author

dominicshanshan commented Feb 24, 2021

to compare the result. You can specify check_exact=False for float equality.

I tested, it seems if your test datatype is float, it will use assert_almost_equal directly, else will use assert_equal, and don't need pass check_exact=False

will follow you guide and give you pytest tomorrow

@isVoid
Copy link
Contributor

isVoid commented Feb 24, 2021

Awesome, and please change the base of this PR to branch-0.19.

@dominicshanshan
Copy link
Contributor Author

Awesome, and please change the base of this PR to branch-0.19.

please find pytest version of unit test for spearman correlation function, I will also finish this feature in Series and pytest, will change to 0.19 brench
pytest_spearman_corr.zip

@dominicshanshan dominicshanshan changed the base branch from branch-0.18 to branch-0.19 February 25, 2021 06:18
Copy link
Contributor Author

@dominicshanshan dominicshanshan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

black series.py passed

Copy link
Contributor Author

@dominicshanshan dominicshanshan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check again

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied some changes to avoid using external only APIs. The below are some new changes to reduce duplication. I think the PR is pretty close!

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved
dominicshanshan and others added 3 commits March 22, 2022 13:03
Co-authored-by: Michael Wang <isVoid@users.noreply.github.com>
Co-authored-by: Michael Wang <isVoid@users.noreply.github.com>
Co-authored-by: Michael Wang <isVoid@users.noreply.github.com>
@codecov
Copy link

codecov bot commented Mar 22, 2022

Codecov Report

Merging #7141 (3ac1595) into branch-22.04 (4596244) will increase coverage by 0.03%.
The diff coverage is 95.93%.

❗ Current head 3ac1595 differs from pull request most recent head 29e31bf. Consider uploading reports for the commit 29e31bf to get more accurate results

@@               Coverage Diff                @@
##           branch-22.04    #7141      +/-   ##
================================================
+ Coverage         86.13%   86.17%   +0.03%     
================================================
  Files               139      141       +2     
  Lines             22438    22508      +70     
================================================
+ Hits              19328    19397      +69     
- Misses             3110     3111       +1     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/tests/test_accessor.py 98.41% <ø> (ø)
python/cudf/cudf/core/_base_index.py 85.92% <50.00%> (-0.51%) ⬇️
python/cudf/cudf/core/column/decimal.py 91.30% <73.68%> (-1.01%) ⬇️
python/cudf/cudf/core/column/categorical.py 89.63% <84.61%> (-0.29%) ⬇️
python/dask_cudf/dask_cudf/io/text.py 85.29% <85.29%> (ø)
python/cudf/cudf/core/column/string.py 88.91% <94.44%> (+0.64%) ⬆️
python/cudf/cudf/core/dataframe.py 93.54% <95.00%> (-0.04%) ⬇️
python/cudf/cudf/core/column/numerical.py 95.62% <95.83%> (+0.64%) ⬆️
python/cudf/cudf/core/frame.py 91.84% <96.42%> (+0.12%) ⬆️
python/cudf/cudf/_typing.py 94.11% <100.00%> (+0.78%) ⬆️
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 621d26f...29e31bf. Read the comment docs.

@dominicshanshan
Copy link
Contributor Author

I applied some changes to avoid using external only APIs. The below are some new changes to reduce duplication. I think the PR is pretty close!

@isVoid , appreciate your help!! why suggest to avoid use external only APIs?

@isVoid
Copy link
Contributor

isVoid commented Mar 22, 2022

@isVoid , appreciate your help!! why suggest to avoid use external only APIs?

Outside of the context of this PR, there are many use cases where we only want the column names and do some work with it. Internally cuDF stores column names as keys of the ColumnAccessor data structure. Constructing a pandas index often are just to mimic columns attribute of a pandas dataframe and is often unnecessary work. This work marks this usage in cuDF apis as an anti-pattern that we will check in tests.

In this PR, however, it's sort of an corner case that we do actually want to explicitly construct an index from the column names. Even so, we would use the to_pandas_index method to construct that so the developer's intent is more clearly expressed.

For more about ColumnAccessor, see: https://docs.rapids.ai/api/cudf/stable/basics/internals.html#columnaccessor

Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just one more comment that you can feel free to address at your own discretion.

python/cudf/cudf/core/series.py Show resolved Hide resolved
Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! great job and thanks!

@bdice
Copy link
Contributor

bdice commented Mar 23, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit ce5bacb into rapidsai:branch-22.04 Mar 23, 2022
rapids-bot bot pushed a commit that referenced this pull request Mar 23, 2022
Follow-up work to fix documentation from #7141 before the 22.04 release.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - https://github.com/brandon-b-miller

URL: #10493
@dominicshanshan dominicshanshan changed the title Add 'spearman' correlation method for dataframe.corr Add 'spearman' correlation method for dataframe.corr and 'series.corr' Mar 24, 2022
@dominicshanshan dominicshanshan changed the title Add 'spearman' correlation method for dataframe.corr and 'series.corr' Add 'spearman' correlation method for dataframe.corr and series.corr Mar 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
9 participants