Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent internal usage of expensive APIs #10263

Merged
merged 8 commits into from
Feb 16, 2022

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Feb 10, 2022

This PR introduces a decorator that can be used to mark APIs that are unnecessarily expensive and should be avoided in internal usage. The decorator does nothing by default, but when the corresponding environment variable is set the decorator wraps decorated functions in a check of the current call stack. If the caller of that API is within the cudf source tree, it will raise an exception that may optionally also indicate an alternate API for developers to use instead. This implementation is completely transparent to users (including having no performance impact aside from some negligible additional import time from applying the decorator) but can be an invaluable tool to developers seeking to reduce Python overheads in cuDF Python. This decorator has been tested and shown to work (both locally and on CI) for DataFrame.columns, but those changes are not included in this PR in order to keep the changeset clean.

@vyasr vyasr added feature request New feature or request 2 - In Progress Currently a work in progress Python Affects Python cuDF API. Performance Performance related issue non-breaking Non-breaking change labels Feb 10, 2022
@vyasr vyasr added this to the CuDF Python Refactoring milestone Feb 10, 2022
@vyasr vyasr self-assigned this Feb 10, 2022
@github-actions github-actions bot added the gpuCI label Feb 10, 2022
@codecov
Copy link

codecov bot commented Feb 10, 2022

Codecov Report

Merging #10263 (cc72147) into branch-22.04 (a7d88cd) will increase coverage by 0.23%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04   #10263      +/-   ##
================================================
+ Coverage         10.42%   10.66%   +0.23%     
================================================
  Files               119      122       +3     
  Lines             20603    20898     +295     
================================================
+ Hits               2148     2228      +80     
- Misses            18455    18670     +215     
Impacted Files Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/main.py 0.00% <ø> (ø)
python/cudf/cudf/_version.py 0.00% <ø> (ø)
python/cudf/cudf/comm/gpuarrow.py 0.00% <ø> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/column.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/methods.py 0.00% <ø> (ø)
... and 62 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f263820...cc72147. Read the comment docs.

@shwina
Copy link
Contributor

shwina commented Feb 10, 2022

This is cool!

I'm wondering about its applicability beyond .columns though -- are there a significant number of other user-facing APIs that we both (1) tend to use internally, and (2) can easily replace with a call to an internal API?

@vyasr
Copy link
Contributor Author

vyasr commented Feb 10, 2022

This is cool!

I'm wondering about its applicability beyond .columns though -- are there a significant number of other user-facing APIs that we both (1) tend to use internally, and (2) can easily replace with a call to an internal API?

That's a great question. I don't anticipate this being terribly broadly applicable, but there are enough use cases for it to be valuable I think. A couple that come to mind:

  • DataFrame.dtypes generates a pandas.Series and I created a simpler internal equivalent DataFrame._dtypes a while ago that just provides a dictionary. Here's the relative performance:
In [3]: %timeit df1.dtypes
129 µs ± 636 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [4]: %timeit df1._dtypes
1.37 µs ± 10.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
  • The constructors of all Frame types should almost never be used internally. I'm not sure we can forbid these entirely since there may be special cases where we need to support odd user-provided inputs, but at minimum this new decorator could be used locally to identify cases where we could instead use _from_data or something like it (which we know is much faster).
  • The user-facing join API is very expensive, but we still use it internally for index alignment all over the place. Addressing [FEA] Refactor cuDF Python merging logic for index types #9978 will help reduce the cost, but even with that I suspect that we will want to write a simpler internal-facing API.
  • Similarly, I think we will want to define a simpler internal version of concat, which is extremely complex. @galipremsagar can probably speak to how frequently it shows up in profiles. There is a lot of extra logic there (similar to joins) that could probably be excised from a simpler internal implementation.

It's possible (I would even say likely...) that there are other slow APIs that we just aren't aware of right now and might turn up upon more careful benchmarking, or APIs that seem unavoidable internally but might actually be more avoidable than we think. For instance, could we get rid of DataFrame.__getitem__ calls entirely? I doubt it, but I think we can at least replace a lot of them with calls to df._data.__getitem__, which is much faster.

@brandon-b-miller
Copy link
Contributor

This is cool!

I'm wondering about its applicability beyond .columns though -- are there a significant number of other user-facing APIs that we both (1) tend to use internally, and (2) can easily replace with a call to an internal API?

Another common one is use of DataFrame.__getitem__ when looping through a dataframe columns rather than accessing via _data.

@vyasr vyasr force-pushed the feature/avoid_internal_apis branch from 16182bf to d7cc9cb Compare February 10, 2022 17:16
@quasiben
Copy link
Member

@vyasr
Copy link
Contributor Author

vyasr commented Feb 10, 2022

Setting the env var for devs could be done with conda env.yml file: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#setting-environment-variables

Thanks for the find @quasiben. I ended up going with a pytest-based approach so that tests will always be run this way.

@vyasr vyasr force-pushed the feature/avoid_internal_apis branch from 064b1f1 to 05dc395 Compare February 15, 2022 19:58
ci/gpu/build.sh Show resolved Hide resolved
@vyasr vyasr removed the 2 - In Progress Currently a work in progress label Feb 15, 2022
@vyasr vyasr added the 3 - Ready for Review Ready for review by team label Feb 15, 2022
@vyasr vyasr marked this pull request as ready for review February 15, 2022 20:05
@vyasr vyasr requested review from a team as code owners February 15, 2022 20:05
python/cudf/cudf/tests/conftest.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/conftest.py Outdated Show resolved Hide resolved
python/cudf/cudf/utils/utils.py Outdated Show resolved Hide resolved
python/cudf/cudf/utils/utils.py Outdated Show resolved Hide resolved
python/cudf/cudf/utils/utils.py Outdated Show resolved Hide resolved
python/cudf/cudf/utils/utils.py Outdated Show resolved Hide resolved
python/cudf/cudf/utils/utils.py Outdated Show resolved Hide resolved
@vyasr vyasr requested a review from bdice February 15, 2022 22:20
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Note for the future: I prefer to review PRs like this with an example for context. I would have appreciated seeing this applied to a single function so that I could better rationalize about the design. I saw your comment about it being tested and wanting to keep the changeset clean, but as a reviewer I can't prove that this works as claimed unless it is used at least once.

As a middle ground of "clean changeset" and "design in context," you could include a code example in the PR description or in a comment instead of in the changeset. For this specific case, the decorator is fairly trivial, though, so not worth worrying about an example.

@vyasr
Copy link
Contributor Author

vyasr commented Feb 16, 2022

Looks good. Note for the future: I prefer to review PRs like this with an example for context. I would have appreciated seeing this applied to a single function so that I could better rationalize about the design. I saw your comment about it being tested and wanting to keep the changeset clean, but as a reviewer I can't prove that this works as claimed unless it is used at least once.

As a middle ground of "clean changeset" and "design in context," you could include a code example in the PR description or in a comment instead of in the changeset. For this specific case, the decorator is fairly trivial, though, so not worth worrying about an example.

Unfortunately you're just a bit late to the game. This branch originally did contain an example of applying this to DataFrame.columns and we discussed these changes at some length in our team meeting last week. I only rebased those changes out this morning because in the meeting we agreed that it would be useful to be have this changeset be completely independent in the event that we decide to undo it for some reason. I probably should have removed your request for review and requested one of the others who had a chance to review the original version of this PR, I agree that this is tricky tricky to review in a vacuum. Fortunately @brandon-b-miller (the other requested reviewer) did see and comment on the original design. If you don't feel confident in your review, please request an additional review from someone who commented above since they should have had a chance to at least skim the code with the example.

@bdice
Copy link
Contributor

bdice commented Feb 16, 2022

Haha, that's fine then! I didn't know the context, my apologies. I definitely feel confident enough in the PR to count my approval here.

Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ops-codeowner file changes

@vyasr
Copy link
Contributor Author

vyasr commented Feb 16, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 895b007 into rapidsai:branch-22.04 Feb 16, 2022
rapids-bot bot pushed a commit that referenced this pull request Feb 24, 2022
This PR uses the feature introduced in #10263 to remove the usage of DataFrame.columns where possible.

Currently this removes a large number of internal uses but not all of them to verify that CI does indeed fail as expected since some changes were made on #10263 after the last test of CI by reviewer request. I will convert this from a draft once I see CI fail and once all the necessary changes have been made for tests to pass.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Michael Wang (https://github.com/isVoid)

URL: #10315
@vyasr vyasr deleted the feature/avoid_internal_apis branch March 9, 2022 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change Performance Performance related issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants