Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] need to support cudf.DataFrame([[‘a’,‘b’]]).corr() when set ‘method=spearman’, we only support ‘method=pearson’ so far #6804

Closed
dominicshanshan opened this issue Nov 19, 2020 · 5 comments · Fixed by #7141
Assignees
Labels
feature request New feature or request good first issue Good for newcomers Python Affects Python cuDF API.

Comments

@dominicshanshan
Copy link
Contributor

Is your feature request related to a problem? Please describe.
from time series analysis, need to support spearman correlation matrix calculation in cuDF

Describe the solution you'd like
similar like pandas.DataFrame().corr(method='spearman')

Additional context
is it possible to let me know this feature adding roadmap? big thanks !

@dominicshanshan dominicshanshan added Needs Triage Need team to review and classify feature request New feature or request labels Nov 19, 2020
@beckernick
Copy link
Member

beckernick commented Nov 19, 2020

This would be a welcome addition. This could be implemented (in cuDF Python) by combining series.rank() with the existing pearson correlation. A libcudf implementation would probably want to start with Pearson correlation.

Would you be interested in contributing this?

@beckernick beckernick added Python Affects Python cuDF API. good first issue Good for newcomers and removed Needs Triage Need team to review and classify labels Nov 19, 2020
@dominicshanshan
Copy link
Contributor Author

love to .... but need your guidance ;) @beckernick

@beckernick
Copy link
Member

beckernick commented Nov 20, 2020

Pearson correlation is currently the implementation of correlation in cuDF Python currently. Spearman correlation is defined mathematically as the pearson correlation of the ranks of the columns. Rank is implemented currently. This means we have everything needed in cuDF Python.

There are several ways to do this.

One way would be to use a branching statement inside corr to check whether the method is "spearman". If it is, call rank on both columns before continuing on to call cov. Another way might be to implement two "private" functions, _corr_pearson and _corr_spearman. _corr_pearson could be the existing implementation, and _corr_spearman would use rank and then call _corr_pearson. The corr function could then delegate appropriately depending on the method argument. These could even be sub-functions inside corr, since it's unlikely they will be used elsewhere. Either way, you could update the numericalcolumn and series level corr function to allow for spearman.

The dataframe implementation would need to be tweaked less, as we use CuPy's pearson correlation for efficiency. So you'd need to pass self.rank().values and support the method argument.

def corr(self, other):
if len(self) == 0 or len(other) == 0:
return cudf.utils.dtypes._get_nan_for_dtype(self.dtype)
cov = self.cov(other)
lhs_std, rhs_std = self.std(), other.std()
if not cov or lhs_std == 0 or rhs_std == 0:
return cudf.utils.dtypes._get_nan_for_dtype(self.dtype)
return cov / lhs_std / rhs_std

def corr(self):
"""Compute the correlation matrix of a DataFrame.
"""
corr = cupy.corrcoef(self.values, rowvar=False)
df = DataFrame(cupy.asfortranarray(corr)).set_index(self.columns)
df.columns = self.columns
return df

dominicshanshan added a commit to dominicshanshan/cudf that referenced this issue Jan 14, 2021
…rame-add-spearman

[REVIEW]Add 'spearman' correlation matrix in dataframe.py rapidsai#6804
@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@dominicshanshan
Copy link
Contributor Author

need more time to finish, holiday season in China ;)

rapids-bot bot pushed a commit that referenced this issue Mar 23, 2022
Closes #6804

Adds 'spearman' correlation method for `dataframe.corr`

Authors:
  - https://github.com/dominicshanshan
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - Nick Becker (https://github.com/beckernick)
  - https://github.com/brandon-b-miller

URL: #7141
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good for newcomers Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants