Add initial support for whole-array reduction on NVIDIA GPUs #23689

e-kayrakli · 2023-10-23T19:22:40Z

Addresses Step 1 in #23324
Resolves https://github.com/Cray/chapel-private/issues/5513

This PR adds several gpu*Reduce functions to perform whole-array reductions for arrays that are allocated in GPU memory. The functions added cover Chapel's default reductions and they are named:

gpuSumReduce
gpuMinReduce
gpuMaxReduce
gpuMinLocReduce
gpuMaxLocReduce

NVIDIA implementation

This is done by wrapping CUB: https://nvlabs.github.io/cub/. CUB is a C++ header-only template library. We wrap it with some macro magic in the runtime. This currently increases runtime build time by quite a bit. We might consider wrapping the functions from the library in non-inline helpers that can help a bit.

AMD implementation

AMD has hipCUB: https://rocm.docs.amd.com/projects/hipCUB/en/latest/ and a slightly lower-level, AMD-only rocPRIM: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/. I couldn't get either to work. I can't run a simple HIP reproducer based off of one of their tests. I might be doing something wrong in compilation, but what I am getting is a segfault in the launched kernel (or hipLaunchKernel). I filed ROCm/hipCUB#304, but haven't received a response quick enough to address in this PR.

This is really unfortunate, but maybe we'll have a better/native reduction support soon and we can cover AMD there, too.

Implementation details:

chpl_gpu_X_reduce_Y functions are added to the main runtime interface via macros. Here, X is the reduction kind, Y is the data type. This one prints debugging output, finds the stream to run the reduction on and calls its impl cousin.
chpl_gpu_impl_X_reduce_Y are added to the implementation layer in a similar fashion.
These functions are added to gpu/Z/gpu-Z-reduce.cc in the runtime where Z is either nvidia or amd
- AMD versions are mostly "implemented" the way I think they should work, but because of the segfaults that I was getting, they are in the else branch of an #if 1 at the moment.
The module code has a private doGpuReduce that calls the appropriate runtime function for any reduction type. This function has some similarities to how atomics are implemented. Unfortunately the interfaces are different enough that I can't come up with a good way to refactor some of the helpers. All the reduction helpers are nested in doGpuReduce to avoid confusion.
To workaround a CUB limitation that prevents reducing arrays whose size is close to max(int(32)), the implementation runs the underlying CUB implementation with at most 2B elements at a time and stitches the result on the host, if it ends up calling the implementation multiple times. The underlying issue is captured in:
- [NVBug 2818198] Thrust and CUB have ongoing issues with large indices (problem size > INT32_MAX) NVIDIA/cccl#744
- Thrust large input support NVIDIA/cccl#49

Future work:

Keep an eye on the AMD bug report
Implement a fall back when we're ready to run an in-house reduction if the bug remains unresolved.

Test Status

nvidia
amd
flat make check

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

e-kayrakli · 2023-10-26T04:14:14Z

With the new implementation to handle very large arrays, I wanted to see how the performance scales with larger data. Current implementation chunks up the input array into maximum 2B-element chunks. Here's sum-reduce throughput with increasing data size on A100:

N	GiB/s
2B	427.555
4B	687.735
8B	976.821
10B	1073.51

Array elements are of type int. I compute the throughput as "bytes processed per second". I am happy that there's no performance drop.

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

stonea

Definitely great to see this. Being able to efficiently do reductions on the GPU is definitely something important that we were missing.

Do you have any thoughts about wrapping other "cub" functions (scans, sorts, etc.)?

More generally, do you have any thoughts about if we should do a push to wrap more functions like this (in cub, in the cuda runtime lib, and in other common CUDA libs like cuBlas, cuFFT, etc, etc.). I'm inclined to say we should, because:

ultimately, even if our goal is to abstract things behind nicer, Chapeltastic syntax, we'll probably need these APIs for our own internal use anyway so the effort we'd put into wrapping things wouldn't be wasted effort.
I imagine a lot of of folks might be turned off from using GPU support just because we're missing feature X and if it's relatively minimal effort to add feature X just by wrapping some existing third-party library function why shouldn't we try and prioritize that?

Anyway, if you think there's a chance we might want to do more "wrappings" in the near-term future, then I get a little more worried that this PR might be setting the precedent on how to go about doing that. And if that's the case then I think it might be worth putting more (sub)team wide effort into reviewing this PR. Really, I'm left wondering:

if there's anything we could do to make wrapping things like this for GPU support easier, and/or
if there's anything we could do to reduce the amount of duplicate effort we'll have when doing these future wrappings

Anyway, I'm inclined to say maybe we should bring this up as a brainstorming topic to the rest of the group, like show everyone what we've done for this PR (and say the atomic one too), and see if:

anyone has any ideas on how to make this less painful, and even if not
just let people know what the current "best practice" is so folks don't reinvent it if they end up wrapping something else.

I wouldn't hold off submitting the PR in the meantime. But I would like to hear your thoughts on it.

test/gpu/native/reduction/SKIPIF

modules/standard/GPU.chpl

runtime/src/chpl-gpu.c

runtime/src/gpu/nvidia/gpu-nvidia-reduce.cc

modules/standard/GPU.chpl

e-kayrakli · 2023-10-30T23:54:59Z

Do you have any thoughts about wrapping other "cub" functions (scans, sorts, etc.)?

https://github.com/Cray/chapel-private/issues/5330 captures that to some extent. @ShreyasKhandekar will be working on adding some native implementations, where the ultimate goal is to wrap CUB support for those functions.

Anyway, if you think there's a chance we might want to do more "wrappings" in the near-term future, then I get a little more worried that this PR might be setting the precedent on how to go about doing that. And if that's the case then I think it might be worth putting more (sub)team wide effort into reviewing this PR. Really, I'm left wondering:

Everything you say about this resonates with me. The ideal goal would be to do all these wrapping in Chapel without runtime and that connects to C++ interop in general. I think @lydia-duncan is planning to start looking into what we need for partial C++ interop, but that is not going to happen very soon.

It is perplexing that your atomics effort and this reduction effort are independent. Admittedly, I haven't spent enough time in the runtime to understand whether we can do some generalization there. But I was a bit frustrated to see that there's no clear refactor in the module code.

For "less fundamental" stuff like cuBLAS and cuFFT: things might be a bit brighter: they are all aligned with historical libraries (BLAS, FFTW) as far as I can see, so they have C APIs. So, hopefully they are not runtime efforts (you already know that we have a cuBLAS wrapper that we should probably revive)

stonea

For followup work:

I'm also curious if you've had a chance to look at / consider nccl, which also has reductions: https://developer.nvidia.com/nccl

And it looks like this is the AMD equivalent (https://github.com/ROCmSoftwarePlatform/rccl)

It looks like it supports multi-gpu / multi-node patterns which is definitely something we'd be interested in.

I'm not sure, maybe it uses cub under-the-hood?

EDIT: Looks like we had some conversation about it here: https://github.com/Cray/chapel-private/issues/4878

e-kayrakli · 2023-11-01T17:21:04Z

I'm also curious if you've had a chance to look at / consider nccl, which also has reductions

I haven't actually. Thanks for bringing that up.

NCCL is a collective communication library and tries to be similar to MPI in its interface. But the whole idea with it is to reduce data across multiple GPUs for example. It looks like it can run on different parallel structures:

single-threaded control of all GPUs
multi-threaded, for example, using one thread per GPU
multi-process, for example, MPI

Where we fall into the second category.

But the need for NCCL may arise when reducing a block distributed array across multiple GPUs, for example. Right now, reducing arrays sitting on multiple GPUs' memories is not something we can represent in the language, unfortunately.

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

…the behavior Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

@stonea

Resolves Cray/chapel-private#5609 Continuation from where I left off in #23689. In that PR, I struggled with segfaults with AMD GPUs, so I had to back out of AMD support. Turns out AMD GPUs tend to segfault at execution time if you don't use the right `--offload-arch`, and that the default on the system that I tested this was not right. This PR adds that while compiling the reduction support code in the runtime to remove the blockage. ### Details - The runtime now has `chpl_gpu_can_reduce`/`chpl_gpu_impl_can_reduce` interface that returns true/false depending on whether we can use this cub-based reduction support - Today, it returns false for cpu-as-device mode or ROCm 4.x which doesn't have hipcub - For cases where there's no cub-based reduction, we fallback to regular CPU-based reductions. On ROCm 4.x this means we copy the array to the host and reduce on host. Clearly, this is less than ideal and just a portability stopgap. I hope to drop ROCm 4 support as soon as we can - Adds a new `rocm-utils` header to be able to use `ROCM_VERSION_MAJOR` portably, and to be able to use `ROCM_CALL` in multiple files - Moves `test/gpu/native/noAmd/reduction` directory to `test/gpu/native` and removes `noAmd.skipif` [Reviewed by @stonea] ### Test - [x] nvidia - [x] amd with ROCm 4.2 - [x] amd with ROCm 4.4 - [x] amd with ROCm 5.2 `gpu/native/reduction` only - [x] amd with ROCm 5.4 `gpu/native/reduction` only - [x] cpu `gpu/native/reduction` only

e-kayrakli added 25 commits October 16, 2023 15:36

Snapshot

096fdba

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Get things to link in a hacky way

9020bf5

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Get the initial example working

df97abb

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Cleanup

01bcf84

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Drop additional makefile logic

38dc53f

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Finish implementing basic reductions

3417301

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Add test

17118b3

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Add minloc and maxloc reduces

b5e689b

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Refactor basic reduce runtime interface, too

f036931

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Move the actual reduction functions to the GPU module

41e30a9

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Simplify runtime macros

8d0e3e0

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Start separating impl implementations

c56df6b

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Seperate runtime interface from the implementation interface

7aa3642

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Runtime cleanup

cf1e526

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Add a common header

1489028

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Blind implementation of reductions on AMD

60f921c

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Add a compilerError in the module code

ab9b475

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Test/limit usage on AMD

5ab3e5e

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Fix an issue, add perf test

0819e31

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Initial attempt to do multi-chunk reduction

3a78ab2

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Add a new test and fix an issue exposed by it

19ad754

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Add new test

06be1a8

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Expand test to minloc,maxloc. Fix a bug

d012a9b

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Make the new functions work with cpu-as-device. Add skipifs

e85a00c

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Add the missing good file

186ad54

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

e-kayrakli added 4 commits October 25, 2023 21:15

Remove a trailing whitespace

4b6953f

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Revert some of the AMD changes

aca38c4

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Add the missing execopts

257a6ae

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Remove an include

9c49498

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

e-kayrakli requested a review from stonea October 26, 2023 23:02

e-kayrakli marked this pull request as ready for review October 26, 2023 23:03

stonea reviewed Oct 30, 2023

View reviewed changes

stonea approved these changes Nov 1, 2023

View reviewed changes

stonea reviewed Nov 1, 2023

View reviewed changes

e-kayrakli added 13 commits November 3, 2023 09:21

Relocate tests into a new noAmd directory

1627385

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Add a user facing error message for unknown types and a test to lock …

ab8a348

…the behavior Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Add more fall-through otherwises

7725662

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Start adding documentation

6da77ee

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Add one more fallthrough, unify error messages

889ef73

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Free runtime memory that we were leaking before

377f97d

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

A big refactor to reduce code duplication significantly

1675b9e

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Fix a bug for non-zero-based arrays, add test

dd10162

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Remove trailing whitespaces

7f007b7

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

More clarifications in doc

727f443

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Add a missing space

79ea359

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Add missing commas in AMD runtime

cc39ef7

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Move skipif to the parent directory

00ee967

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

e-kayrakli merged commit befb7fb into chapel-lang:main Nov 6, 2023
7 checks passed

e-kayrakli deleted the gpu-cub-reduce branch November 6, 2023 22:30

e-kayrakli mentioned this pull request Nov 29, 2023

Add support for gpu*Reduce functions on AMD GPUs #23950

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial support for whole-array reduction on NVIDIA GPUs #23689

Add initial support for whole-array reduction on NVIDIA GPUs #23689

e-kayrakli commented Oct 23, 2023 •

edited

Loading

e-kayrakli commented Oct 26, 2023

stonea left a comment •

edited

Loading

e-kayrakli commented Oct 30, 2023

stonea left a comment •

edited

Loading

e-kayrakli commented Nov 1, 2023

Add initial support for whole-array reduction on NVIDIA GPUs #23689

Add initial support for whole-array reduction on NVIDIA GPUs #23689

Conversation

e-kayrakli commented Oct 23, 2023 • edited Loading

NVIDIA implementation

AMD implementation

Implementation details:

Future work:

Test Status

e-kayrakli commented Oct 26, 2023

stonea left a comment • edited Loading

Choose a reason for hiding this comment

e-kayrakli commented Oct 30, 2023

stonea left a comment • edited Loading

Choose a reason for hiding this comment

e-kayrakli commented Nov 1, 2023

e-kayrakli commented Oct 23, 2023 •

edited

Loading

stonea left a comment •

edited

Loading

stonea left a comment •

edited

Loading