[Experimental] Use nosync policy for Thrust calls. #11577

bdice · 2022-08-22T16:28:42Z

Description

This PR draft is an experiment using rmm::exec_policy_nosync to call all Thrust algorithms with the thrust::cuda::par_nosync execution policy. This removes many instances of stream synchronization in Thrust, except when required for correctness (e.g. if the algorithm returns a value to the host, a sync is required).

At present, two commits have been benchmarked. 946cf5d directly replaces all instances of the execution policy with exec_policy_nosync. This could lead to unsynced streams when the libcudf public API returns for some functions, but shows a clear performance benefit for small data sizes in the benchmarks: https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70

Commit 2552c4c adds a manual stream synchronization at the end of every public API. This is guaranteed to be correct but the final stream sync may not be necessary for all APIs if the detail API already synced, leading to lower performance for some APIs in the benchmarks: https://gist.github.com/bdice/4ade40a2e66d555fb8edc85f78eec0a2

I don't intend for this PR to be merged (or reviewed as-is) at this point -- there are better designs for managing syncs that we could explore like RAII, I have some internal refactors I'd like to make before engaging in such a large refactor, and it should certainly be done in pieces -- this PR is currently just a way to share preliminary data and start a discussion for improved stream handling.

A few notes:

The approach of syncing at the end of every public API in 2552c4c probably isn't safe enough, because an internal detail API could be constructing host memory whose lifetime doesn't guarantee a safe copy to device without a stream sync before returning (thought experiment identified by @jrhemstad). Changing to nosync execution policies will require analysis of every use case individually, at both the detail and public API levels.
Both before and after this PR, libcudf consumers can not assume that streams are synced when a function returns. See [DOC] Discuss and document expected stream synchronization behavior of libcudf functions #4511, [FEA] External libcudf APIs should expose CUDA streams #925.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…API.

kkraus14 · 2022-08-22T19:16:22Z

Commit 2552c4c adds a manual stream synchronization at the end of every public API. This is guaranteed to be correct but the final stream sync may not be necessary for all APIs if the detail API already synced, leading to lower performance for some APIs in the benchmarks: https://gist.github.com/bdice/4ade40a2e66d555fb8edc85f78eec0a2

Historically I believe we decided that libcudf shouldn't do any stream synchronization except what is required for correctness or when returning data to host memory that could be easily operated on without synchronization by typical CPU code erroneously.

In general I'm an enthusiastic +1 to the idea of this PR.

codecov · 2022-08-22T19:28:37Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@288c81f). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 2552c4c differs from pull request most recent head 048f026. Consider uploading reports for the commit 048f026 to get more accurate results

Additional details and impacted files

@@               Coverage Diff               @@
##             branch-22.10   #11577   +/-   ##
===============================================
  Coverage                ?   86.40%           
===============================================
  Files                   ?      145           
  Lines                   ?    22958           
  Branches                ?        0           
===============================================
  Hits                    ?    19837           
  Misses                  ?     3121           
  Partials                ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@GregoryKimball

…11600) This PR is derived from changes I made in #11577 while attempting to consolidate stream handling in public APIs. During that refactoring, I noticed three repeated problems across libcudf APIs that I have addressed in this PR. These refactors will make future work on streams much more straightforward as well as increase consistency and quality in the library. 1. Some APIs were putting too much implementation in a public method. I split these so that the public/detail balance is consistent with the rest of libcudf. 2. A number of public APIs were missing `CUDF_FUNC_RANGE`, making it difficult to recognize those functions in profiles (cc: @GregoryKimball). 3. Stream handling was not consistent, with some functions not using the `stream` they were passed and using `cudf::default_stream_value` instead. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) URL: #11600

cpp/src/io/utilities/hostdevice_vector.hpp

jrhemstad · 2022-09-01T17:53:08Z

Historically I believe we decided that libcudf shouldn't do any stream synchronization except what is required for correctness or when returning data to host memory that could be easily operated on without synchronization by typical CPU code erroneously.

That's correct, but we've also been extraordinarily lax in our stream sync safety. I'm pretty confident that there are many places where we're copying from host memory asynchronously without synchronizing to ensure the host object is still valid before the copy completes. I think the only thing saving us are the facts that we're copying from pageable memory and that Thrust is injecting a bunch of artificial syncs for us.

Eliminating all artificial syncs in favor of putting a single one at the end is a reasonable step forward, but getting to a point where every libcudf API is 100% safe for asynchrony with the absolute minimum number of syncs is going to be a lot of work.

bdice · 2022-09-01T19:58:44Z

I wanted to add some notes from offline discussions, for the record -- and to provide some forward guidance on how this might be resolved.

First, a few performance highlights that motivate this work:

Gather/double_coalesce_o/1024/8/manual_time takes 58% less time (gather across 8 columns with 1024 rows) with no syncs, or 47% less time with a single sync at the end, because all those independent column operations can be batched without the overhead of syncing.
Concatenate/BM_concatenate_tables_nullable_false/256/8/2/manual_time takes 67% less time with no syncs, or 61% less time with a single sync at the end.
Search/ColumnContains_AllValid/1024/manual_time takes 48% less time with no syncs, or 41% less time with a single sync at the end.

Above, I noted the performance of two different modes: "no syncs" and "single final sync." To know whether a final sync (or intermediate syncs) are required for correctness, we must do manual analysis rather than the fully-automated batch refactor I took in this PR. That is why this PR is experimental and cannot be merged directly -- it was just a way to find what could be improved by reducing the number of syncs, assuming that stream safety is never an issue (which is an unsafe assumption). We have to do manual analysis to prevent the kind of potential problems that @jrhemstad mentioned:

I'm pretty confident that there are many places where we're copying from host memory asynchronously without synchronizing to ensure the host object is still valid before the copy completes. I think the only thing saving us are the facts that we're copying from pageable memory and that Thrust is injecting a bunch of artificial syncs for us. I think the only thing saving us are the facts that we're copying from pageable memory and that Thrust is injecting a bunch of artificial syncs for us.

The consensus among those I've spoken with (@jrhemstad, @davidwendt, and developers on Spark/Python teams) seems to be that the performance improvements of fewer syncs for small data sizes would be a worthwhile change. However, it will require manual analysis of stream correctness in each API, and there are no tools that can help us automate this process to make it faster than the manual task of "thinking about it." Edge cases such as ensuring host memory is async-copied before the end of its lifetime are crucial to catch for correctness.

Eliminating all artificial syncs in favor of putting a single one at the end is a reasonable step forward, but getting to a point where every libcudf API is 100% safe for asynchrony with the absolute minimum number of syncs is going to be a lot of work.

Agreed, it is a large undertaking. My goal for the "manual anlysis" is to pick a few APIs that show a substantial performance increase from the benchmarks above, and manually verify which Thrust calls can safely use a nosync policy. For the short term, we are not planning to remove all unnecessary syncs. We're only going to work on cases that have a clear performance benefit like those mentioned above.

Next steps:

Adopt some smaller refactors of detail/public stream handling from my experimental branch into branch-22.10 to ease the process. (This was completed in Refactors of public/detail APIs, CUDF_FUNC_RANGE, stream handling. #11600.)
Prioritize a few APIs for manually analyzing stream behavior and enabling nosync. Gather and scatter are likely to be first due to their relative simplicity and strong performance improvements from nosync.
Add docs about libcudf's promises about stream syncs in both detail and public APIs.

Fixes some calls that were not passing the stream variable to detail functions. Found these while looking into improvements for #11577 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) - Mark Harris (https://github.com/harrism) URL: #11642

) Adds calls to `cudf::column.set_null_count()` when the null-count is known. Found these while looking into improvements for #11577 There are several ways to make a `cudf::column` object to be returned. Many times the column is created and then filled in by calling the `cudf::column.mutable_view()` function and using the `mutable_view` object. The `cudf::column::mutable_view()` function has a side-effect that invalidates it's internal null-count. This is for efficiency so the null-count is only computed when the value is specifically requested through the `cudf::column::null_count()` method. Computing the null-count inside `null_count()` requires a kernel launch. However, there are several places where the null-count is known before returning the column and setting the value means a later call to `cudf::column::null_count()` does not require it to be computed. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - https://github.com/nvdbaranec URL: #11646

github-actions · 2022-10-01T20:03:05Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

This PR uses `rmm::exec_policy_nosync` in libcudf's gather and scatter functions. These changes are motivated by performance improvements seen previously in #11577. # Checklist - [x] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [x] New or existing tests cover these changes. - [x] The documentation is up to date with these changes. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - David Wendt (https://github.com/davidwendt) - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #12038

bdice · 2022-11-17T17:27:13Z

I'm closing this PR. The major findings from this experiment and next steps are documented in issue #12086.

I believe most of the documentation tasks mentioned above regarding stream safety are now in the Developer Guide section: "Treat libcudf APIs as if they were asynchronous."

bdice added 4 commits August 17, 2022 10:34

Start using nosync in gather.

e7e068f

Mass find-replace of rmm::exec_policy with rmm::exec_policy_nosync.

59ab362

Merge remote-tracking branch 'upstream/branch-22.10' into thrust-nosync

946cf5d

Always synchronize the default stream before returning from a public …

2552c4c

…API.

github-actions bot added Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Aug 22, 2022

bdice added improvement Improvement / enhancement to an existing function breaking Breaking change 5 - DO NOT MERGE Hold off on merging; see PR for details labels Aug 22, 2022

bdice mentioned this pull request Aug 25, 2022

Refactors of public/detail APIs, CUDF_FUNC_RANGE, stream handling. #11600

Merged

3 tasks

bdice commented Sep 1, 2022

View reviewed changes

cpp/src/io/utilities/hostdevice_vector.hpp Outdated Show resolved Hide resolved

Remove TODO.

048f026

This was referenced Sep 1, 2022

Fix some libcudf detail calls not passing the stream variable #11642

Merged

Call set_null_count on a returning column if null-count is known #11646

Merged

github-actions bot added the inactive-30d label Oct 1, 2022

This was referenced Nov 1, 2022

Use nosync policy in gather and scatter implementations. #12038

Merged

[FEA] Migrate to Thrust nosync stream policy for performance. #12086

Open

bdice closed this Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experimental] Use nosync policy for Thrust calls. #11577

[Experimental] Use nosync policy for Thrust calls. #11577

bdice commented Aug 22, 2022 •

edited

Loading

kkraus14 commented Aug 22, 2022

codecov bot commented Aug 22, 2022 •

edited

Loading

jrhemstad commented Sep 1, 2022

bdice commented Sep 1, 2022 •

edited

Loading

github-actions bot commented Oct 1, 2022

bdice commented Nov 17, 2022

[Experimental] Use nosync policy for Thrust calls. #11577

[Experimental] Use nosync policy for Thrust calls. #11577

Conversation

bdice commented Aug 22, 2022 • edited Loading

Description

Checklist

kkraus14 commented Aug 22, 2022

codecov bot commented Aug 22, 2022 • edited Loading

Codecov Report

jrhemstad commented Sep 1, 2022

bdice commented Sep 1, 2022 • edited Loading

github-actions bot commented Oct 1, 2022

bdice commented Nov 17, 2022

bdice commented Aug 22, 2022 •

edited

Loading

codecov bot commented Aug 22, 2022 •

edited

Loading

bdice commented Sep 1, 2022 •

edited

Loading