Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experimental] Use nosync policy for Thrust calls. #11577

Closed
wants to merge 5 commits into from

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Aug 22, 2022

Description

This PR draft is an experiment using rmm::exec_policy_nosync to call all Thrust algorithms with the thrust::cuda::par_nosync execution policy. This removes many instances of stream synchronization in Thrust, except when required for correctness (e.g. if the algorithm returns a value to the host, a sync is required).

At present, two commits have been benchmarked. 946cf5d directly replaces all instances of the execution policy with exec_policy_nosync. This could lead to unsynced streams when the libcudf public API returns for some functions, but shows a clear performance benefit for small data sizes in the benchmarks: https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70

Commit 2552c4c adds a manual stream synchronization at the end of every public API. This is guaranteed to be correct but the final stream sync may not be necessary for all APIs if the detail API already synced, leading to lower performance for some APIs in the benchmarks: https://gist.github.com/bdice/4ade40a2e66d555fb8edc85f78eec0a2

I don't intend for this PR to be merged (or reviewed as-is) at this point -- there are better designs for managing syncs that we could explore like RAII, I have some internal refactors I'd like to make before engaging in such a large refactor, and it should certainly be done in pieces -- this PR is currently just a way to share preliminary data and start a discussion for improved stream handling.

A few notes:

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Aug 22, 2022
@bdice bdice added improvement Improvement / enhancement to an existing function breaking Breaking change 5 - DO NOT MERGE Hold off on merging; see PR for details labels Aug 22, 2022
@kkraus14
Copy link
Collaborator

Commit 2552c4c adds a manual stream synchronization at the end of every public API. This is guaranteed to be correct but the final stream sync may not be necessary for all APIs if the detail API already synced, leading to lower performance for some APIs in the benchmarks: https://gist.github.com/bdice/4ade40a2e66d555fb8edc85f78eec0a2

Historically I believe we decided that libcudf shouldn't do any stream synchronization except what is required for correctness or when returning data to host memory that could be easily operated on without synchronization by typical CPU code erroneously.

In general I'm an enthusiastic +1 to the idea of this PR.

@codecov
Copy link

codecov bot commented Aug 22, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@288c81f). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 2552c4c differs from pull request most recent head 048f026. Consider uploading reports for the commit 048f026 to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.10   #11577   +/-   ##
===============================================
  Coverage                ?   86.40%           
===============================================
  Files                   ?      145           
  Lines                   ?    22958           
  Branches                ?        0           
===============================================
  Hits                    ?    19837           
  Misses                  ?     3121           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

rapids-bot bot pushed a commit that referenced this pull request Aug 26, 2022
…11600)

This PR is derived from changes I made in #11577 while attempting to consolidate stream handling in public APIs. During that refactoring, I noticed three repeated problems across libcudf APIs that I have addressed in this PR. These refactors will make future work on streams much more straightforward as well as increase consistency and quality in the library.

1. Some APIs were putting too much implementation in a public method. I split these so that the public/detail balance is consistent with the rest of libcudf.
2. A number of public APIs were missing `CUDF_FUNC_RANGE`, making it difficult to recognize those functions in profiles (cc: @GregoryKimball).
3. Stream handling was not consistent, with some functions not using the `stream` they were passed and using `cudf::default_stream_value` instead.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11600
@jrhemstad
Copy link
Contributor

Historically I believe we decided that libcudf shouldn't do any stream synchronization except what is required for correctness or when returning data to host memory that could be easily operated on without synchronization by typical CPU code erroneously.

That's correct, but we've also been extraordinarily lax in our stream sync safety. I'm pretty confident that there are many places where we're copying from host memory asynchronously without synchronizing to ensure the host object is still valid before the copy completes. I think the only thing saving us are the facts that we're copying from pageable memory and that Thrust is injecting a bunch of artificial syncs for us.

Eliminating all artificial syncs in favor of putting a single one at the end is a reasonable step forward, but getting to a point where every libcudf API is 100% safe for asynchrony with the absolute minimum number of syncs is going to be a lot of work.

@bdice
Copy link
Contributor Author

bdice commented Sep 1, 2022

I wanted to add some notes from offline discussions, for the record -- and to provide some forward guidance on how this might be resolved.

First, a few performance highlights that motivate this work:

  • Gather/double_coalesce_o/1024/8/manual_time takes 58% less time (gather across 8 columns with 1024 rows) with no syncs, or 47% less time with a single sync at the end, because all those independent column operations can be batched without the overhead of syncing.
  • Concatenate/BM_concatenate_tables_nullable_false/256/8/2/manual_time takes 67% less time with no syncs, or 61% less time with a single sync at the end.
  • Search/ColumnContains_AllValid/1024/manual_time takes 48% less time with no syncs, or 41% less time with a single sync at the end.

Above, I noted the performance of two different modes: "no syncs" and "single final sync." To know whether a final sync (or intermediate syncs) are required for correctness, we must do manual analysis rather than the fully-automated batch refactor I took in this PR. That is why this PR is experimental and cannot be merged directly -- it was just a way to find what could be improved by reducing the number of syncs, assuming that stream safety is never an issue (which is an unsafe assumption). We have to do manual analysis to prevent the kind of potential problems that @jrhemstad mentioned:

I'm pretty confident that there are many places where we're copying from host memory asynchronously without synchronizing to ensure the host object is still valid before the copy completes. I think the only thing saving us are the facts that we're copying from pageable memory and that Thrust is injecting a bunch of artificial syncs for us. I think the only thing saving us are the facts that we're copying from pageable memory and that Thrust is injecting a bunch of artificial syncs for us.

The consensus among those I've spoken with (@jrhemstad, @davidwendt, and developers on Spark/Python teams) seems to be that the performance improvements of fewer syncs for small data sizes would be a worthwhile change. However, it will require manual analysis of stream correctness in each API, and there are no tools that can help us automate this process to make it faster than the manual task of "thinking about it." Edge cases such as ensuring host memory is async-copied before the end of its lifetime are crucial to catch for correctness.

Eliminating all artificial syncs in favor of putting a single one at the end is a reasonable step forward, but getting to a point where every libcudf API is 100% safe for asynchrony with the absolute minimum number of syncs is going to be a lot of work.

Agreed, it is a large undertaking. My goal for the "manual anlysis" is to pick a few APIs that show a substantial performance increase from the benchmarks above, and manually verify which Thrust calls can safely use a nosync policy. For the short term, we are not planning to remove all unnecessary syncs. We're only going to work on cases that have a clear performance benefit like those mentioned above.

Next steps:

  • Adopt some smaller refactors of detail/public stream handling from my experimental branch into branch-22.10 to ease the process. (This was completed in Refactors of public/detail APIs, CUDF_FUNC_RANGE, stream handling. #11600.)
  • Prioritize a few APIs for manually analyzing stream behavior and enabling nosync. Gather and scatter are likely to be first due to their relative simplicity and strong performance improvements from nosync.
  • Add docs about libcudf's promises about stream syncs in both detail and public APIs.

rapids-bot bot pushed a commit that referenced this pull request Sep 6, 2022
Fixes some calls that were not passing the stream variable to detail functions.
Found these while looking into improvements for #11577

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)
  - Mark Harris (https://github.com/harrism)

URL: #11642
rapids-bot bot pushed a commit that referenced this pull request Sep 7, 2022
)

Adds calls to `cudf::column.set_null_count()` when the null-count is known.
Found these while looking into improvements for #11577 
There are several ways to make a `cudf::column` object to be returned. Many times the column is created and then filled in by calling the `cudf::column.mutable_view()` function and using the `mutable_view` object. The `cudf::column::mutable_view()` function has a side-effect that invalidates it's internal null-count. This is for efficiency so the null-count is only computed when the value is specifically requested through the `cudf::column::null_count()` method. Computing the null-count inside `null_count()` requires a kernel launch. However, there are several places where the null-count is known before returning the column and setting the value means a later call to `cudf::column::null_count()` does not require it to be computed.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - https://github.com/nvdbaranec

URL: #11646
@github-actions
Copy link

github-actions bot commented Oct 1, 2022

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this pull request Nov 7, 2022
This PR uses `rmm::exec_policy_nosync` in libcudf's gather and scatter functions. These changes are motivated by performance improvements seen previously in #11577.

# Checklist
- [x] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

URL: #12038
@bdice
Copy link
Contributor Author

bdice commented Nov 17, 2022

I'm closing this PR. The major findings from this experiment and next steps are documented in issue #12086.

I believe most of the documentation tasks mentioned above regarding stream safety are now in the Developer Guide section: "Treat libcudf APIs as if they were asynchronous."

@bdice bdice closed this Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - DO NOT MERGE Hold off on merging; see PR for details breaking Breaking change improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants