-
Notifications
You must be signed in to change notification settings - Fork 420
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add initial support for whole-array reduction on NVIDIA GPUs (#23689)
Addresses Step 1 in #23324 Resolves Cray/chapel-private#5513 This PR adds several `gpu*Reduce` functions to perform whole-array reductions for arrays that are allocated in GPU memory. The functions added cover Chapel's default reductions and they are named: - `gpuSumReduce` - `gpuMinReduce` - `gpuMaxReduce` - `gpuMinLocReduce` - `gpuMaxLocReduce` ### NVIDIA implementation This is done by wrapping CUB: https://nvlabs.github.io/cub/. CUB is a C++ header-only template library. We wrap it with some macro magic in the runtime. This currently increases runtime build time by quite a bit. We might consider wrapping the functions from the library in non-inline helpers that can help a bit. ### AMD implementation AMD has hipCUB: https://rocm.docs.amd.com/projects/hipCUB/en/latest/ and a slightly lower-level, AMD-only rocPRIM: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/. I couldn't get either to work. I can't run a simple HIP reproducer based off of one of their tests. I might be doing something wrong in compilation, but what I am getting is a segfault in the launched kernel (or `hipLaunchKernel`). I filed ROCm/hipCUB#304, but haven't received a response quick enough to address in this PR. This is really unfortunate, but maybe we'll have a better/native reduction support soon and we can cover AMD there, too. ### Implementation details: - `chpl_gpu_X_reduce_Y` functions are added to the main runtime interface via macros. Here, X is the reduction kind, Y is the data type. This one prints debugging output, finds the stream to run the reduction on and calls its `impl` cousin. - `chpl_gpu_impl_X_reduce_Y` are added to the implementation layer in a similar fashion. - These functions are added to `gpu/Z/gpu-Z-reduce.cc` in the runtime where Z is either `nvidia` or `amd` - AMD versions are mostly "implemented" the way I think they should work, but because of the segfaults that I was getting, they are in the `else` branch of an `#if 1` at the moment. - The module code has a private `doGpuReduce` that calls the appropriate runtime function for any reduction type. This function has some similarities to how atomics are implemented. Unfortunately the interfaces are different enough that I can't come up with a good way to refactor some of the helpers. All the reduction helpers are nested in `doGpuReduce` to avoid confusion. - To workaround a CUB limitation that prevents reducing arrays whose size is close to `max(int(32))`, the implementation runs the underlying CUB implementation with at most 2B elements at a time and stitches the result on the host, if it ends up calling the implementation multiple times. The underlying issue is captured in: - https://github.com/NVIDIA/thrust/issues/1271 - NVIDIA/cccl#49 ### Future work: - Keep an eye on the AMD bug report - Implement a fall back when we're ready to run an in-house reduction if the bug remains unresolved. [Reviewed by @stonea] ### Test Status - [x] nvidia - [x] amd - [x] flat `make check`
- Loading branch information
Showing
33 changed files
with
874 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
/* | ||
* Copyright 2020-2023 Hewlett Packard Enterprise Development LP | ||
* Copyright 2004-2019 Cray Inc. | ||
* Other additional copyright holders may be indicated within. * | ||
* The entirety of this work is licensed under the Apache License, | ||
* Version 2.0 (the "License"); you may not use this file except | ||
* in compliance with the License. | ||
* | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
#ifdef HAS_GPU_LOCALE | ||
|
||
#define GPU_IMPL_REDUCE(MACRO, impl_kind, chpl_kind) \ | ||
MACRO(impl_kind, chpl_kind, int8_t) \ | ||
MACRO(impl_kind, chpl_kind, int16_t) \ | ||
MACRO(impl_kind, chpl_kind, int32_t) \ | ||
MACRO(impl_kind, chpl_kind, int64_t) \ | ||
MACRO(impl_kind, chpl_kind, uint8_t) \ | ||
MACRO(impl_kind, chpl_kind, uint16_t) \ | ||
MACRO(impl_kind, chpl_kind, uint32_t) \ | ||
MACRO(impl_kind, chpl_kind, uint64_t) \ | ||
MACRO(impl_kind, chpl_kind, float) \ | ||
MACRO(impl_kind, chpl_kind, double); | ||
|
||
#define GPU_REDUCE(MACRO, chpl_kind) \ | ||
MACRO(chpl_kind, int8_t) \ | ||
MACRO(chpl_kind, int16_t) \ | ||
MACRO(chpl_kind, int32_t) \ | ||
MACRO(chpl_kind, int64_t) \ | ||
MACRO(chpl_kind, uint8_t) \ | ||
MACRO(chpl_kind, uint16_t) \ | ||
MACRO(chpl_kind, uint32_t) \ | ||
MACRO(chpl_kind, uint64_t) \ | ||
MACRO(chpl_kind, float) \ | ||
MACRO(chpl_kind, double); | ||
|
||
#endif // HAS_GPU_LOCALE | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.