Skip to content

Commit

Permalink
Disable opportunistic reuse in async mr when cuda driver < 11.5 (#993)
Browse files Browse the repository at this point in the history
With NVIDIA/spark-rapids#4710 we found some issues with the async pool that may cause memory errors with older drivers. This was confirmed with the cuda team. For driver version < 11.5, we'll disable `cudaMemPoolReuseAllowOpportunistic`.

@abellina

Authors:
  - Rong Ou (https://github.com/rongou)

Approvers:
  - Alessandro Bellina (https://github.com/abellina)
  - Jake Hemstad (https://github.com/jrhemstad)
  - Mark Harris (https://github.com/harrism)
  - Leo Fang (https://github.com/leofang)

URL: #993
  • Loading branch information
rongou authored Mar 16, 2022
1 parent 3992c3c commit 438d312
Showing 1 changed file with 12 additions and 0 deletions.
12 changes: 12 additions & 0 deletions include/rmm/mr/device/cuda_async_memory_resource.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,18 @@ class cuda_async_memory_resource final : public device_memory_resource {
pool_props.location.id = rmm::detail::current_device().value();
RMM_CUDA_TRY(cudaMemPoolCreate(&cuda_pool_handle_, &pool_props));

// CUDA drivers before 11.5 have known incompatibilities with the async allocator.
// We'll disable `cudaMemPoolReuseAllowOpportunistic` if cuda driver < 11.5.
// See https://github.com/NVIDIA/spark-rapids/issues/4710.
int driver_version{};
RMM_CUDA_TRY(cudaDriverGetVersion(&driver_version));
constexpr auto min_async_version{11050};
if (driver_version < min_async_version) {
int disabled{0};
RMM_CUDA_TRY(
cudaMemPoolSetAttribute(cuda_pool_handle_, cudaMemPoolReuseAllowOpportunistic, &disabled));
}

auto const [free, total] = rmm::detail::available_device_memory();

// Need an l-value to take address to pass to cudaMemPoolSetAttribute
Expand Down

0 comments on commit 438d312

Please sign in to comment.