Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make isinstance check pass for proxy ndarrays #16601

Merged
merged 25 commits into from
Sep 5, 2024

Conversation

Matt711
Copy link
Contributor

@Matt711 Matt711 commented Aug 19, 2024

Description

Closes #14537.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@Matt711 Matt711 added feature request New feature or request Python Affects Python cuDF API. 5 - DO NOT MERGE Hold off on merging; see PR for details non-breaking Non-breaking change cudf.pandas Issues specific to cudf.pandas labels Aug 19, 2024
@Matt711 Matt711 self-assigned this Aug 19, 2024
@Matt711 Matt711 requested a review from a team as a code owner August 19, 2024 18:57
@Matt711 Matt711 removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Aug 22, 2024
@rapidsai rapidsai deleted a comment from copy-pr-bot bot Aug 26, 2024
@Matt711
Copy link
Contributor Author

Matt711 commented Aug 26, 2024

/ok to test

@Matt711
Copy link
Contributor Author

Matt711 commented Aug 27, 2024

This PR fixes the two issues with #16286 (ie. the previous version of this PR that we reverted). The two issues were:

  1. In the __array_finalize__(self, obj), the _fsproxy_wrapped attribute is set equal to obj, which could be None. In this PR, there's a check if (obj is None): return to prevent that. I ran the cudf-pandas-integration tests, and it seems to fix the tests that were failing from the previous version of this PR. Once [FEA] Add third-party library integration testing of cudf.pandas to cudf #16645 is merged , we'll be able to see the third-party integration tests passing in cudf.

  2. In __new__, there was an initial DtoH transfer when a cp.ndarray was passed to the proxy array class. This PR doesn't include the DtoH transfer. Instead, in __new__, we're creating a "garbage" np.ndarray with the correct shape and dtype, but with incorrect/garbage data.

The fix for 2. leads to a problem which the only way I can think of solving is via monkeypatching. The problem is that now that our proxy array passes the check isinstance(proxy, np.ndarray), there are many numpy functions (which will call NumPy C functions) that will assume that they can use the array's buffer instead instead of going through a double underscore method like __array__. This is a problem because our proxy array's buffer has garbage data, so the function will produce garbage data.

Take np.asarray for an example. It will eventually lead to some like code like this. Our proxy array passes the PyArray_Check but gives incorrect data in the PyArray_DATA call.

if (PyArray_Check(obj)) {
    void* buffer = PyArray_DATA(obj) // access the buffer directly
}

I think the only way to avoid this problem from Python is to monkeypatch every function like np.asarray. So far I've found several functions we'll need to patch: np.asarray, np.where, np.outer, np.inner, and several other functions in the np.linalg module.

There is at least one other function in a third-party library which we'll need to patch (torch.from_numpy). This function function could be patched in the same way the numpy functions are.

Numba CPU dispatched functions also access the proxy arrays buffer. The good news is that eventually numba will support compiling objects which implement __array__ (see numba/9584). This bad news is that this probably won't be ready before 24.10. So in the meantime, we would have to let the user know that using proxy arrays directly in CPU dispatched functions will return incorrect results. If there's not a way to fail loudly in this case, then it should at least be documented. FYI, we currently fail when using proxy arrays in CPU dispatched functions because we fail isinstance check. So if its possible to continue failing, that's not the worst thing to until the next Numba release.

What do you all think of the monkey patching approach?

Another idea would be pay the cost of the DtoH transfer upfront (like in the previous PR). The DtoH is only done on instance creation. And then after that, the fast-slow proxy mechanism is used. The pro here is that the buffer will be set correctly, and thus no monkeypatching. The con is obviously the DtoH transfer.

Copy link

copy-pr-bot bot commented Aug 27, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Matt711
Copy link
Contributor Author

Matt711 commented Aug 27, 2024

/ok to test

@vyasr
Copy link
Contributor

vyasr commented Aug 27, 2024

Summary of offline discussion: monkey-patching numpy to make this work is a bridge too far without much stronger motivations, and we will probably just go with the eager D2H copy instead.

@Matt711
Copy link
Contributor Author

Matt711 commented Aug 31, 2024

Ping @vyasr for a review next week

@bdice
Copy link
Contributor

bdice commented Sep 3, 2024

@Matt711 The PR description says "do not merge" but there is no "DO NOT MERGE" label. Can you make this consistent?

Also for team knowledge, the "Description" section of the PR body is used in the final commit message when the PR is merged. Temporary information like the PR state or benchmarks are better to put in comments rather than the "Description" section.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of suggestions for improvement, but assuming you don't object to applying them I don't need to review again.

python/cudf/cudf/pandas/fast_slow_proxy.py Outdated Show resolved Hide resolved
python/cudf/cudf/pandas/proxy_base.py Outdated Show resolved Hide resolved
@Matt711 Matt711 requested a review from a team as a code owner September 4, 2024 22:24
@Matt711
Copy link
Contributor Author

Matt711 commented Sep 4, 2024

/ok to test

.github/workflows/pr.yaml Outdated Show resolved Hide resolved
@galipremsagar
Copy link
Contributor

/okay to test

@galipremsagar
Copy link
Contributor

/okay to test

@Matt711
Copy link
Contributor Author

Matt711 commented Sep 5, 2024

/merge

@rapids-bot rapids-bot bot merged commit e1ab1e7 into rapidsai:branch-24.10 Sep 5, 2024
85 checks passed
rjzamora pushed a commit to rjzamora/cudf that referenced this pull request Sep 6, 2024
rapids-bot bot pushed a commit that referenced this pull request Sep 6, 2024
The torch test should no longer fail after #16601.

Authors:
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - James Lamb (https://github.com/jameslamb)
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #16705
rapids-bot bot pushed a commit that referenced this pull request Sep 6, 2024
Proxy numpy arrays now instances of real numpy arrays (#16601), so libraries (eg. numba, torch) which utilize NumPy's C API should now be able to use proxy arrays. This PR updates the cudf.pandas documentation to reflect this.

Authors:
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #16697
res-life pushed a commit to res-life/cudf that referenced this pull request Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cudf.pandas Issues specific to cudf.pandas feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Proxy ndarrays don't pass an instancecheck() for np.ndarray
5 participants