forked from NVIDIA/spark-rapids
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update UCX documentation for RX_QUEUE_LEN and Docker (NVIDIA#1804)
* Add UCX RX_QUEUE_LEN configuration, add docker run example, and remove old README. Signed-off-by: Alessandro Bellina <abellina@nvidia.com> * Use UCX_IB_RX_QUEUE_LEN * Move shuffle documentation to additional-functionality * Apply review feedback * Update docs/additional-functionality/rapids-shuffle.md Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com> Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>
- Loading branch information
Showing
3 changed files
with
329 additions
and
337 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,321 @@ | ||
--- | ||
layout: page | ||
title: RAPIDS Shuffle Manager | ||
parent: Additional Functionality | ||
nav_order: 5 | ||
--- | ||
# RAPIDS Shuffle Manager | ||
--- | ||
**NOTE** | ||
|
||
The _RAPIDS Shuffle Manager_ is a beta feature! | ||
|
||
--- | ||
|
||
The RAPIDS Shuffle Manager is an implementation of the `ShuffleManager` interface in Apache Spark | ||
that allows custom mechanisms to exchange shuffle data. It has two components: a spillable cache, | ||
and a transport that can utilize Remote Direct Memory Access (RDMA) and high-bandwidth transfers | ||
within a node that has multiple GPUs. This is possible because the plugin utilizes | ||
[Unified Communication X (UCX)](https://www.openucx.org/) as its transport. | ||
|
||
- **Spillable cache**: This store keeps GPU data close by where it was produced in device memory, | ||
but can spill in the following cases: | ||
- GPU out of memory: If an allocation in the GPU failed to acquire memory, spill will get triggered | ||
moving GPU buffers to host to allow for the original allocation to succeed. | ||
- Host spill store filled: If the host memory store has reached a maximum threshold | ||
(`spark.rapids.memory.host.spillStorageSize`), host buffers will be spilled to disk until | ||
the host spill store shrinks back below said configurable threshold. | ||
|
||
Tasks local to the producing executor will short-circuit read from the cache. | ||
|
||
- **Transport**: Handles block transfers between executors using various means like NVLink, | ||
PCIe, Infiniband (IB), RDMA over Converged Ethernet (RoCE) or TCP, and as configured in UCX, | ||
in these scenarios: | ||
- GPU-to-GPU: Shuffle blocks that were able to fit in GPU memory. | ||
- Host-to-GPU and Disk-to-GPU: Shuffle blocks that spilled to host (or disk) but will be manifested | ||
in the GPU in the downstream Spark task. | ||
|
||
### System Setup | ||
|
||
In order to enable the RAPIDS Shuffle Manager, UCX user-space libraries and its dependencies must | ||
be installed on the host and within Docker containers. A host has additional requirements, like the | ||
MLNX_OFED driver and `nv_peer_mem` kernel module. | ||
|
||
#### Baremetal | ||
|
||
1. If you have Mellanox hardware, please ensure you have the [MLNX_OFED driver](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed), and the | ||
[`nv_peer_mem` kernel module](https://www.mellanox.com/products/GPUDirect-RDMA) installed. UCX packages | ||
are compatible with MLNX_OFED 5.0+. Please install the latest driver available. | ||
|
||
With `nv_peer_mem` (GPUDirectRDMA), IB/RoCE-based transfers can perform zero-copy transfers | ||
directly from GPU memory. Note that GPUDirectRDMA is known to show | ||
[performance and bugs](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#supported-systems) | ||
in machines that don't connect their GPUs and NICs to PCIe switches (i.e. directly to the | ||
root-complex). | ||
|
||
If you encounter issues or poor performance, GPUDirectRDMA can be controlled via the | ||
UCX environment variable `UCX_IB_GPU_DIRECT_RDMA=no`, but please | ||
[file a GitHub issue](https://github.com/NVIDIA/spark-rapids/issues) so we can investigate | ||
further. | ||
|
||
2. (skip if you followed Step 1) For setups without RoCE/Infiniband, UCX 1.9.0 packaging | ||
requires RDMA packages for installation. UCX 1.10.0+ will relax these requirements for these | ||
types of environments, so this step should not be needed in the future. | ||
|
||
If you still want to install UCX 1.9.0 in a machine without RoCE/Infiniband hardware, please | ||
build and install `rdma-core`. You can use the [Docker sample below](#ucx-minimal-dockerfile) | ||
as reference. | ||
|
||
3. Fetch and install the UCX package for your OS and CUDA version | ||
[UCX 1.9.0](https://github.com/openucx/ucx/releases/tag/v1.9.0). | ||
|
||
UCX versions 1.9 and below require the user to install the cuda-compat package | ||
matching the cuda version of the UCX package (i.e. `cuda-compat-11-1`), in addition to: | ||
`ibverbs-providers`, `libgomp1`, `libibverbs1`, `libnuma1`, `librdmacm1`. | ||
|
||
For UCX versions 1.10.0+, UCX will drop the `cuda-compat` requirement, and remove explicit | ||
RDMA dependencies greatly simplifying installation in some cases. Note that these dependencies | ||
have been met if you followed Steps 1 or 2 above. | ||
|
||
#### Docker containers | ||
|
||
Running with UCX in containers imposes certain requirements. In a multi-GPU system, all GPUs that | ||
want to take advantage of PCIe peer-to-peer or NVLink need to be visible within the container. For | ||
example, if two containers are trying to communicate and each have an isolated GPU, the link between | ||
these GPUs will not be optimal, forcing UCX to stage buffers to the host or use TCP. | ||
Additionally, if you want to use RoCE/Infiniband, the `/dev/infiniband` device should be exposed | ||
in the container. | ||
|
||
If UCX will be used to communicate between containers, the IPC (`--ipc`) and | ||
PID namespaces (`--pid`) should also be shared. | ||
|
||
As of the writing of this document we have successfully tested `--privileged` containers, which | ||
essentially turns off all isolation. We are also assuming `--network=host` is specified, allowing | ||
the container to share the host's network. We will revise this document to include any new | ||
configurations as we are able to test different scenarios. | ||
|
||
1. A system administrator should have performed Step 1 in [Baremetal](#baremetal) in the | ||
host system. | ||
|
||
2. Within the Docker container we need to install UCX and its requirements. The following is an | ||
example of a Docker container that shows how to install `rdma-core` and UCX 1.9.0 with | ||
`cuda-10.1` support. You can use this as a base layer for containers that your executors | ||
will use. | ||
|
||
<a name="ucx-minimal-dockerfile"></a> | ||
|
||
``` | ||
ARG CUDA_VER=10.1 | ||
# Throw away image to build rdma_core | ||
FROM ubuntu:18.04 as rdma_core | ||
RUN apt update | ||
RUN apt-get install -y dh-make git build-essential cmake gcc libudev-dev libnl-3-dev libnl-route-3-dev ninja-build pkg-config valgrind python3-dev cython3 python3-docutils pandoc | ||
RUN git clone --depth 1 --branch v33.0 https://github.com/linux-rdma/rdma-core | ||
RUN cd rdma-core && debian/rules binary | ||
# Now start the main container | ||
FROM nvidia/cuda:${CUDA_VER}-devel-ubuntu18.04 | ||
COPY --from=rdma_core /*.deb /tmp/ | ||
RUN apt update | ||
RUN apt-get install -y cuda-compat-10-1 wget udev dh-make libnuma1 libudev-dev libnl-3-dev libnl-route-3-dev python3-dev cython3 | ||
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v1.9.0/ucx-v1.9.0-ubuntu18.04-mofed5.0-1.0.0.0-cuda10.1.deb | ||
RUN dpkg -i /tmp/*.deb && rm -rf /tmp/*.deb | ||
``` | ||
|
||
### Validating UCX Environment | ||
|
||
After installing UCX you can utilize `ucx_info` and `ucx_perftest` to validate the installation. | ||
|
||
In this section, we are using a docker container built using the sample dockerfile above. | ||
|
||
1. Start the docker container with `--privileged` mode. In this example, we are also adding | ||
`--device /dev/infiniband` to make Mellanox devices available for our test: | ||
``` | ||
nvidia-docker run \ | ||
--network=host \ | ||
--device /dev/infiniband \ | ||
--privileged \ | ||
-it \ | ||
ucx_container:latest \ | ||
/bin/bash | ||
``` | ||
If you are testing between different machines, please run the above command in each node. | ||
|
||
2. Test to check whether UCX can link against CUDA: | ||
``` | ||
root@test-machin:/# ucx_info -d|grep cuda | ||
# Memory domain: cuda_cpy | ||
# Component: cuda_cpy | ||
# Transport: cuda_copy | ||
# Device: cuda | ||
# Memory domain: cuda_ipc | ||
# Component: cuda_ipc | ||
# Transport: cuda_ipc | ||
# Device: cuda | ||
``` | ||
|
||
3. Mellanox device seen by UCX, and what transports are enabled (i.e. `rc`) | ||
``` | ||
root@test-machine:/# ucx_info -d|grep mlx5_3:1 -B1 | ||
# Transport: rc_verbs | ||
# Device: mlx5_3:1 | ||
-- | ||
# Transport: rc_mlx5 | ||
# Device: mlx5_3:1 | ||
-- | ||
# Transport: dc_mlx5 | ||
# Device: mlx5_3:1 | ||
-- | ||
# Transport: ud_verbs | ||
# Device: mlx5_3:1 | ||
-- | ||
# Transport: ud_mlx5 | ||
# Device: mlx5_3:1 | ||
``` | ||
|
||
4. You should be able to execute `ucx_perftest`, and get a good idea that things are working as | ||
you expect. | ||
|
||
Example 1: GPU <-> GPU in the same host. Without NVLink you should expect PCIe speeds. In this | ||
case this is PCIe3, and somewhere along the lines of ~10GB/sec is expected. It should also match | ||
the performance seen in `p2pBandwidthLatencyTest`, which is included with the cuda toolkit. | ||
|
||
- On server container: | ||
``` | ||
root@test-server:/# CUDA_VISIBLE_DEVICES=0 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda | ||
``` | ||
|
||
- On client container: | ||
``` | ||
root@test-client:/# CUDA_VISIBLE_DEVICES=1 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda localhost | ||
+--------------+--------------+-----------------------------+---------------------+-----------------------+ | ||
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) | | ||
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+ | ||
| Stage | # iterations | typical | average | overall | average | overall | average | overall | | ||
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+ | ||
Final: 1000 0.000 986.122 986.122 9670.96 9670.96 1014 1014 | ||
``` | ||
Example 2: GPU <-> GPU across the network, using GPUDirectRDMA. You will notice that in this | ||
example we picked GPU 3. In our test machine, GPU 3 is closest (same root complex) to the NIC | ||
we are using for RoCE, and yields better performance than GPU 0, for example, which is sitting | ||
on a different socket. | ||
|
||
- On server container: | ||
``` | ||
root@test-server: CUDA_VISIBLE_DEVICES=3 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda | ||
``` | ||
- On client container: | ||
``` | ||
root@test-client:/# CUDA_VISIBLE_DEVICES=3 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda test-server | ||
+--------------+--------------+-----------------------------+---------------------+-----------------------+ | ||
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) | | ||
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+ | ||
| Stage | # iterations | typical | average | overall | average | overall | average | overall | | ||
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+ | ||
[thread 0] 498 0.000 2016.444 2016.444 4729.49 4729.49 496 496 | ||
[thread 0] 978 0.000 2088.412 2051.766 4566.50 4648.07 479 487 | ||
Final: 1000 0.000 3739.639 2088.899 2550.18 4565.44 267 479 | ||
``` | ||
Example 3: GPU <-> GPU across the network, without GPUDirectRDMA. You will notice that the | ||
bandwidth achieved is higher than with GPUDirectRDMA on. This is expected, and a known issue in | ||
machines where GPUs and NICs are connected directly to the root complex. | ||
|
||
- On server container: | ||
``` | ||
root@test-server:/# UCX_IB_GPU_DIRECT_RDMA=no CUDA_VISIBLE_DEVICES=3 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda | ||
``` | ||
|
||
- On client container: | ||
``` | ||
root@test-client:/# UCX_IB_GPU_DIRECT_RDMA=no CUDA_VISIBLE_DEVICES=3 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda test-server | ||
+--------------+--------------+-----------------------------+---------------------+-----------------------+ | ||
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) | | ||
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+ | ||
| Stage | # iterations | typical | average | overall | average | overall | average | overall | | ||
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+ | ||
[thread 0] 670 0.000 1497.859 1497.859 6366.91 6366.91 668 668 | ||
Final: 1000 0.000 1718.843 1570.784 5548.35 6071.33 582 637 | ||
``` | ||
|
||
### Spark App Configuration | ||
|
||
1. Choose the version of the shuffle manager that matches your Spark version. | ||
Currently we support: | ||
|
||
| Spark Shim | spark.shuffle.manager value | | ||
| -----------| -------------------------------------------------------- | | ||
| 3.0.0 | com.nvidia.spark.rapids.spark300.RapidsShuffleManager | | ||
| 3.0.0 EMR | com.nvidia.spark.rapids.spark300emr.RapidsShuffleManager | | ||
| 3.0.1 | com.nvidia.spark.rapids.spark301.RapidsShuffleManager | | ||
| 3.0.1 EMR | com.nvidia.spark.rapids.spark301emr.RapidsShuffleManager | | ||
| 3.0.2 | com.nvidia.spark.rapids.spark302.RapidsShuffleManager | | ||
| 3.1.1 | com.nvidia.spark.rapids.spark311.RapidsShuffleManager | | ||
|
||
2. Recommended settings for UCX 1.9.0+ | ||
```shell | ||
... | ||
--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark300.RapidsShuffleManager \ | ||
--conf spark.shuffle.service.enabled=false \ | ||
--conf spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,rc,tcp \ | ||
--conf spark.executorEnv.UCX_ERROR_SIGNALS= \ | ||
--conf spark.executorEnv.UCX_RNDV_SCHEME=put_zcopy \ | ||
--conf spark.executorEnv.UCX_MAX_RNDV_RAILS=1 \ | ||
--conf spark.executorEnv.UCX_MEMTYPE_CACHE=n \ | ||
--conf spark.executorEnv.UCX_IB_RX_QUEUE_LEN=1024 \ | ||
--conf spark.executorEnv.LD_LIBRARY_PATH=/usr/lib:/usr/lib/ucx \ | ||
--conf spark.executor.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR} | ||
``` | ||
|
||
Please note `LD_LIBRARY_PATH` should optionally be set if the UCX library is installed in a | ||
non-standard location. | ||
|
||
#### UCX Environment Variables | ||
- `UCX_TLS`: | ||
- `cuda_copy`, and `cuda_ipc`: enables handling of CUDA memory in UCX, both for copy-based transport | ||
and peer-to-peer communication between GPUs (NVLink/PCIe). | ||
- `rc`: enables Infiniband and RoCE based transport in UCX. | ||
- `tcp`: allows for TCP communication in cases where UCX deems necessary. | ||
- `UCX_ERROR_SIGNALS=`: Disables UCX signal catching, as it can cause issues with the JVM. | ||
- `UCX_MAX_RNDV_RAILS=1`: Set this to `1` to disable multi-rail transfers in UCX, where UCX splits | ||
data to utilize various channels (e.g. two NICs). A value greater than `1` can cause a performance drop | ||
for high-bandwidth transports between GPUs. | ||
- `UCX_MEMTYPE_CACHE=n`: Disables a cache in UCX that can cause UCX to fail when running with CUDA buffers. | ||
- `UCX_RNDV_SCHEME=put_zcopy`: By default, `UCX_RNDV_SCHEME=auto` will pick different schemes for | ||
the RNDV protocol (`get_zcopy` or `put_zcopy`) depending on message size, and on other parameters | ||
given the hardware, transports, and settings. We have found that `UCX_RNDV_SCHEME=put_zcopy` | ||
is more reliable than automatic detection, or `get_zcopy` in our testing, especially in UCX 1.9.0. | ||
The main difference between get and put is the direction of transfer. A send operation under | ||
`get_zcopy` will really be `RDMA READ` from the receiver, whereas the same send will be | ||
`RDMA_WRITE` from the sender if `put_zcopy` is utilized. | ||
- `UCX_IB_RX_QUEUE_LEN=1024`: Length of receive queue for the Infiniband/RoCE transports. The | ||
length change is recommended as it has shown better performance when there is memory pressure | ||
and message sizes are relatively large (> few hundred Bytes) | ||
|
||
### Fine Tuning | ||
Here are some settings that could be utilized to fine tune the _RAPIDS Shuffle Manager_: | ||
|
||
#### Bounce Buffers | ||
The following configs control the number of bounce buffers, and the size. Please note that for | ||
device buffers, two pools are created (for sending and receiving). Take this into account when | ||
sizing your pools. | ||
|
||
The GPU buffers should be smaller than the [`PCI BAR | ||
Size`](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#bar-sizes) for your GPU. Please verify | ||
the [defaults](../configs.md) work in your case. | ||
|
||
- `spark.rapids.shuffle.ucx.bounceBuffers.device.count` | ||
- `spark.rapids.shuffle.ucx.bounceBuffers.host.count` | ||
- `spark.rapids.shuffle.ucx.bounceBuffers.size` | ||
|
||
#### Spillable Store | ||
This setting controls the amount of host memory (RAM) that can be utilized to spill GPU blocks when | ||
the GPU is out of memory, before going to disk. Please verify the [defaults](../configs.md). | ||
- `spark.rapids.memory.host.spillStorageSize` |
Oops, something went wrong.