Add rmm::prefetch() and DeviceBuffer.prefetch() #1573

harrism · 2024-05-30T06:40:08Z

Description

This adds two rmm::prefetch() functions in C++.

rmm::prefetch(void *ptr, size_t bytes, device, stream)
rmm::prefetch<T>(cuda::std::span<T> data, device, stream)

Item 2 enables prefetching the containers that RMM provides (device_uvector, device_scalar) that support conversion to cuda::std::span. In order to enable that, device_scalar::size() is added.

Note that device_buffers must be prefetched using item 1 because you can't create a span<void>.

In Python, this adds DeviceBuffer.prefetch() because that's really the only RMM Python data type to prefetch. There is one Cython use of device_uvector in cuDF join that we might need to add prefetch support for later.

prefetch is a no-op on non-managed memory. Rather than querying the type of memory, it just catches cudaErrorInvalidValue from cudaMemPrefetchAsync.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

include/rmm/device_buffer.hpp

jrhemstad · 2024-05-30T18:59:05Z

I don't think this should be a member function. The fact that the member function is only useful for specific upstream resource types is a big code smell to me.

Instead, I think this should be a freestanding function like:

template <typename T>
auto prefetch(T const* ptr, size_t n, device, stream);

harrism · 2024-06-04T03:06:15Z

I don't think this should be a member function. The fact that the member function is only useful for specific upstream resource types is a big code smell to me. Instead, I think this should be a freestanding function

I actually had the same thought over the weekend. But there are a few other factors to consider.

In RMM Python, the only thing one has to prefetch is a DeviceBuffer. And pointers are non-Pythonic. So I think it makes sense to associate prefetch with DeviceBuffer in the python API, either as a method or a free function that expects a DeviceBuffer.
I do think we want to put unconditional calls to prefetch in algorithm code (e.g. in cuDF). These should be functional no-ops when the memory is not managed. (A migrateable or similar property for cuda::mr would help with that...)
What do you think about versions for containers and iterators? I think only providing a pointer interface is error-prone, and makes it harder than necessary to prefetch a range from a container, or a whole container.

template <typename Iterator>
auto prefetch(Iterator begin, Iterator end, device, stream)
{
  constexpr auto elt_size = sizeof(std::iterator_traits<Iterator>::value_type);
  return prefetch(begin, std::distance(begin, end) * elt_size, device, stream);
}

template <typename Container>
auto prefetch(Container const& c, device, stream) 
{ 
  return prefetch(c.begin(), c.end(), device, stream);
}

…`span`

jrhemstad · 2024-06-06T19:13:07Z

include/rmm/prefetch.hpp

+ * @param stream The stream to use for the prefetch
+ */
+template <typename T>
+void prefetch(cuda::std::span<T> data, rmm::cuda_device_id device, rmm::cuda_stream_view stream)


This function shouldn't be mutating any of the data.

Suggested change

void prefetch(cuda::std::span<T> data, rmm::cuda_device_id device, rmm::cuda_stream_view stream)

void prefetch(cuda::std::span<T const> data, rmm::cuda_device_id device, rmm::cuda_stream_view stream)

jrhemstad · 2024-06-06T19:16:46Z

include/rmm/prefetch.hpp

+ * @param ptr The pointer to the memory to prefetch
+ * @param size The number of bytes to prefetch
+ * @param device The device to prefetch to
+ * @param stream The stream to use for the prefetch
+ */
+template <typename T>
+void prefetch(T* ptr, std::size_t size, rmm::cuda_device_id device, rmm::cuda_stream_view stream)


If ptr is typed, then size shouldn't be bytes, it should be elements.

Agreed. If this takes T* ptr, it should use size * sizeof(T) to compute the bytes.

Or, if this is really designed for the case of device buffers, it could just use void* ptr and accept size in bytes.

I'm switching it back to void const* because then we can use span::size_bytes() in the span function. Someone suggested the T* version during discussions but I can't remember who or why. If there is a good reason, I'm all ears.

bdice · 2024-06-07T01:31:20Z

include/rmm/prefetch.hpp

+ * @param ptr The pointer to the memory to prefetch
+ * @param size The number of bytes to prefetch
+ * @param device The device to prefetch to
+ * @param stream The stream to use for the prefetch
+ */
+template <typename T>
+void prefetch(T* ptr, std::size_t size, rmm::cuda_device_id device, rmm::cuda_stream_view stream)


Agreed. If this takes T* ptr, it should use size * sizeof(T) to compute the bytes.

Or, if this is really designed for the case of device buffers, it could just use void* ptr and accept size in bytes.

python/rmm/rmm/_lib/device_buffer.pyx

bdice · 2024-06-07T02:01:10Z

python/rmm/rmm/tests/test_rmm.py

+def test_rmm_device_buffer_prefetch(pool, managed):
+    rmm.reinitialize(pool_allocator=pool, managed_memory=managed)
+    db = rmm.DeviceBuffer.to_device(np.zeros(256, dtype="u1"))
+    db.prefetch()  # just test that it doesn't throw


You might be able to test that a prefetch call was issued with cudaMemRangeGetAttribute (void* data, size_t dataSize, cudaMemRangeAttribute attribute, const void* devPtr, size_t count), with attribute cudaMemRangeAttributeLastPrefetchLocation. It ought to be possible to call this via cuda-python (API docs).

Also:

Note that this simply returns the last location that the applicaton requested to prefetch the memory range to. It gives no indication as to whether the prefetch operation to that location has completed or even begun.

You wouldn't know if the prefetch completed or not, but you could verify that the prefetch request was issued.

...of course, as soon as I scrolled down, I see you did this exact thing in the C++ tests, at Vyas's request. It would be nice to have a corresponding Python API test, since it should be quick to write with cuda-python.

CUDA Python results in very ugly code due to its non-pythonic error handling. But I've done what you asked...

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

…nto fea-prefetch-device_buffer

python/rmm/rmm/_lib/device_buffer.pyx

wence- · 2024-06-11T15:00:13Z

Devcontainer build fails are due to (I think) not having NVIDIA/cccl#1836

harrism · 2024-06-12T01:04:49Z

Devcontainer build fails are due to (I think) not having NVIDIA/cccl#1836

I don't understand. This was working previously. I think this broke with the update to CCCL 2.5, is this a regression in 2.5?

@miscco any idea why this was working fine before? I tried to go back to your godbolt example but godbolt only supports up to CCCL 2.2

Note that this only affects device_scalar, device_uvector converts to a span no problem. I confirmed that adding begin/end to device_scalar fixes this, but I'm not sure we want to do that...

leofang · 2024-06-12T03:00:15Z

It seems the doc-build pipeline failed but there's no log. Is it possible to retrigger it?

harrism · 2024-06-12T22:35:25Z

/merge

leofang · 2024-06-13T01:29:56Z

python/rmm/rmm/_lib/device_buffer.pyx

+        device : optional
+            The CUDA device to which to prefetch the memory for this buffer.
+            Defaults to the current CUDA device. To prefetch to the CPU, pass
+            `~cuda.cudart.cudaCpuDeviceId` as the device.


It seems the doc build did not render it right, from here:
https://downloads.rapids.ai/ci/rmm/pull-request/1573/5165889/docs/rmm/html/
Possibly because the default role in RMM is "cpp" instead of "py". If so the fix would be

:py:`~cuda.cudart.cudaCpuDeviceId`

I am certain that the correct reference is there. It can be checked as follows:

$ python -m sphinx.ext.intersphinx https://nvidia.github.io/cuda-python/objects.inv | grep "cuda.cudart.cudaCpuDeviceId" cuda.cudart.cudaCpuDeviceId : module/cudart.html#cuda.cudart.cudaCpuDeviceId

Did this get fixed or should this be raised in a new issue?

I don’t think this was seen after merging. Issue filed: #1635 (comment)

Fix: #1636 (comment)

Thanks Bradley! 🙏

This adds two `rmm::prefetch()` functions in C++. 1. `rmm::prefetch(void *ptr, size_t bytes, device, stream)` 2. `rmm::prefetch<T>(cuda::std::span<T> data, device, stream)` Item 2 enables prefetching the containers that RMM provides (`device_uvector`, `device_scalar`) that support conversion to `cuda::std::span`. In order to enable that, `device_scalar::size()` is added. Note that `device_buffer`s must be prefetched using item 1 because you can't create a `span<void>`. In Python, this adds `DeviceBuffer.prefetch()` because that's really the only RMM Python data type to prefetch. There is *one* Cython use of `device_uvector` in cuDF `join` that we might need to add prefetch support for later. `prefetch` is a no-op on non-managed memory. Rather than querying the type of memory, it just catches `cudaErrorInvalidValue` from `cudaMemPrefetchAsync`. Authors: - Mark Harris (https://github.com/harrism) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Rong Ou (https://github.com/rongou) - Jake Hemstad (https://github.com/jrhemstad) - Michael Schellenberger Costa (https://github.com/miscco) - Vyas Ramasubramani (https://github.com/vyasr) URL: #1573

Add prefetch() method to device_buffer/DeviceBuffer

212e777

github-actions bot added Python Related to RMM Python API cpp Pertains to C++ code labels May 30, 2024

harrism added feature request New feature or request non-breaking Non-breaking change labels May 30, 2024

madsbk reviewed May 30, 2024

View reviewed changes

include/rmm/device_buffer.hpp Outdated Show resolved Hide resolved

harrism added 4 commits June 4, 2024 04:19

Revert addition of device_buffer::prefetch

40bc935

Add new rmm::prefetch() functions and tests.

99f341c

Update Python DeviceBuffer.prefetch() and add pytest.

62ecd78

doc

414cb64

github-actions bot added the CMake label Jun 4, 2024

harrism changed the title ~~Add prefetch() method to device_buffer~~ Add rmm::prefetch() and DeviceBuffer.prefetch() Jun 4, 2024

harrism added 8 commits June 4, 2024 08:46

Support prefetch on device_scalar

4778b27

Merge branch 'branch-24.08' into fea-prefetch-device_buffer

81dc818

Add device_scalar::size() function so we can implicitly convert to …

bc3746b

…`span`

Add conversion operators to cuda::std::span

73a1b0d

Replace various prefetch() functions with a single that takes a span

4d29f84

Remove smelly span conversion from device_buffer.

3661f62

Add to doxygen utilities group

84d77d6

Improve Python documentation

407682f

harrism marked this pull request as ready for review June 5, 2024 02:20

harrism requested review from a team as code owners June 5, 2024 02:20

harrism requested review from rongou and bdice June 5, 2024 02:20

harrism added 2 commits June 5, 2024 02:22

Revert changes to device_buffer includes and copyright.

e79b58c

doc

1acb374

jrhemstad reviewed Jun 6, 2024

View reviewed changes

harrism added 2 commits June 7, 2024 01:45

Add nullary ctor to aid in Cython usage

5b57c05

Remove c_prefetch, consolidate cython code

9b15eaa

bdice reviewed Jun 7, 2024

View reviewed changes

harrism and others added 5 commits June 7, 2024 02:06

Remove template, add const

6ac4576

Update python/rmm/rmm/_lib/device_buffer.pyx

d9cbeaa

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

Test buffer prefetching using cuda Python

92c74d7

Style

1fbcb48

Merge branch 'fea-prefetch-device_buffer' of github.com:harrism/rmm i…

d20541b

…nto fea-prefetch-device_buffer

harrism requested review from bdice, jrhemstad and vyasr June 7, 2024 02:36

Update Cython for prefetch() to match C++ changes.

2bccc74

leofang reviewed Jun 11, 2024

View reviewed changes

python/rmm/rmm/_lib/device_buffer.pyx Outdated Show resolved Hide resolved

jrhemstad approved these changes Jun 11, 2024

View reviewed changes

Merge branch 'branch-24.08' into fea-prefetch-device_buffer

60aed37

miscco approved these changes Jun 11, 2024

View reviewed changes

vyasr approved these changes Jun 11, 2024

View reviewed changes

harrism added 2 commits June 12, 2024 02:01

Work around CCCL regression in test

c43942a

Improve documentation of device parameter. Coauthored by @leofang

5165889

rapids-bot bot merged commit a709394 into rapidsai:branch-24.08 Jun 12, 2024
58 checks passed

leofang reviewed Jun 13, 2024

View reviewed changes

mhaseeb123 mentioned this pull request Jul 12, 2024

Replace cudaMemPrefetchAsync with rmm::prefetch vyasr/cudf#5

Open

bdice mentioned this pull request Aug 1, 2024

Fix prefetch docs CPU cross reference #1635

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rmm::prefetch() and DeviceBuffer.prefetch() #1573

Add rmm::prefetch() and DeviceBuffer.prefetch() #1573

harrism commented May 30, 2024 •

edited

Loading

jrhemstad commented May 30, 2024

harrism commented Jun 4, 2024 •

edited

Loading

jrhemstad Jun 6, 2024

jrhemstad Jun 6, 2024

bdice Jun 7, 2024

harrism Jun 7, 2024

bdice Jun 7, 2024

bdice Jun 7, 2024

bdice Jun 7, 2024

harrism Jun 7, 2024

harrism Jun 7, 2024

wence- commented Jun 11, 2024

harrism commented Jun 12, 2024 •

edited

Loading

leofang commented Jun 12, 2024

harrism commented Jun 12, 2024

leofang Jun 13, 2024 •

edited

Loading

leofang Jun 13, 2024

jakirkham Aug 1, 2024

bdice Aug 1, 2024

bdice Aug 1, 2024

jakirkham Aug 1, 2024

	void prefetch(cuda::std::span<T> data, rmm::cuda_device_id device, rmm::cuda_stream_view stream)
	void prefetch(cuda::std::span<T const> data, rmm::cuda_device_id device, rmm::cuda_stream_view stream)

Add rmm::prefetch() and DeviceBuffer.prefetch() #1573

Add rmm::prefetch() and DeviceBuffer.prefetch() #1573

Conversation

harrism commented May 30, 2024 • edited Loading

Description

Checklist

jrhemstad commented May 30, 2024

harrism commented Jun 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- commented Jun 11, 2024

harrism commented Jun 12, 2024 • edited Loading

leofang commented Jun 12, 2024

harrism commented Jun 12, 2024

leofang Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harrism commented May 30, 2024 •

edited

Loading

harrism commented Jun 4, 2024 •

edited

Loading

harrism commented Jun 12, 2024 •

edited

Loading

leofang Jun 13, 2024 •

edited

Loading