Update RapidsShuffleManager documentation for branch 0.4 (#1716)

* Update RapidsShuffleManager documentation for branch 0.4 Signed-off-by: Alessandro Bellina <abellina@nvidia.com> * Add examples for ucx_info and ucx_perftest * Fix typo and reword slightly * Address review comments * Fix typo * Use markdown built-in anchors
NVIDIA · Feb 18, 2021 · 1a1d6fa · 1a1d6fa
1 parent e6b0fc3
commit 1a1d6fa
Showing 1 changed file with 218 additions and 19 deletions.
diff --git a/docs/get-started/getting-started-on-prem.md b/docs/get-started/getting-started-on-prem.md
@@ -422,33 +422,225 @@ in these scenarios:
   - _GPU-to-GPU_: Shuffle blocks that were able to fit in GPU memory.
   - _Host-to-GPU_ and _Disk-to-GPU_: Shuffle blocks that spilled to host (or disk) but will be manifested 
   in the GPU in the downstream Spark task.
+
+### System Setup
+
+In order to enable _RapidsShuffleManager_, UCX user-space libraries and its dependencies must be 
+installed on the host and within Docker containers. A host has additional requirements, like the 
+MLNX_OFED driver and `nv_peer_mem` kernel module.
+
+#### Baremetal
+
+1. If you have Mellanox hardware, please ensure you have the [MLNX_OFED driver](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed), and the
+[`nv_peer_mem` kernel module](https://www.mellanox.com/products/GPUDirect-RDMA) installed. UCX packages
+   are compatible with MLNX_OFED 5.0+. Please install the latest driver available.
+
+   With `nv_peer_mem` (GPUDirectRDMA), IB/RoCE-based transfers can perform zero-copy transfers 
+   directly from GPU memory. Note that GPUDirectRDMA is known to show 
+   [performance and bugs](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#supported-systems) 
+   in machines that don't connect their GPUs and NICs to PCIe switches (i.e. directly to the 
+   root-complex). 
+
+   If you encounter issues or poor performance, GPUDirectRDMA can be controlled via the 
+   UCX environment variable `UCX_IB_GPU_DIRECT_RDMA=no`, but please 
+   [file a GitHub issue](https://github.com/NVIDIA/spark-rapids/issues) so we can investigate 
+   further.
+
+2. (skip if you followed Step 1) For setups without RoCE/Infiniband, UCX 1.9.0 packaging 
+   requires RDMA packages for installation. UCX 1.10.0+ will relax these requirements for these
+   types of environments, so this step should not be needed in the future.
+
+   If you still want to install UCX 1.9.0 in a machine without RoCE/Infiniband hardware, please 
+   build and install `rdma-core`. You can use the [Docker sample below](#ucx-minimal-dockerfile) 
+   as reference.
+
+3. Fetch and install the UCX package for your OS and CUDA version 
+   [UCX 1.9.0](https://github.com/openucx/ucx/releases/tag/v1.9.0).
+
+   UCX versions 1.9 and below require the user to install the cuda-compat package
+   matching the cuda version of the UCX package (i.e. `cuda-compat-11-1`), in addition to: 
+   `ibverbs-providers`, `libgomp1`, `libibverbs1`, `libnuma1`, `librdmacm1`. 
+
+   For UCX versions 1.10.0+, UCX will drop the `cuda-compat` requirement, and remove explicit 
+   RDMA dependencies greatly simplifying installation in some cases. Note that these dependencies 
+   have been met if you followed Steps 1 or 2 above.
+
+#### Docker containers
+
+Running with UCX in containers imposes certain requirements. In a multi-GPU system, all GPUs that 
+want to take advantage of PCIe peer-to-peer or NVLink need to be visible within the container. For
+example, if two containers are trying to communicate and each have an isolated GPU, the link between
+these GPUs will not be optimal, forcing UCX to stage buffers to the host or use TCP. 
+Additionally, if you want to use RoCE/Infiniband, the `/dev/infiniband` device should be exposed 
+in the container. 
+
+If UCX will be used to communicate between containers, the IPC (`--ipc`) and 
+PID namespaces (`--pid`) should also be shared. 
+
+As of the writing of this document we have successfully tested `--privileged` containers, which
+essentially turns off all isolation. We will revise this document to include any new configurations
+as we are able to test different scenarios.
+
+1. A system administrator should have performed Step 1 in [Baremetal](#baremetal) in the 
+   host system.
+
+2. Within the Docker container we need to install UCX and its requirements. The following is an 
+   example of a Docker container that shows how to install `rdma-core` and UCX 1.9.0 with 
+   `cuda-10.1` support. You can use this as a base layer for containers that your executors 
+   will use.
+
+   <a name="ucx-minimal-dockerfile"></a>
+
+   ```
+   ARG CUDA_VER=10.1
+   
+   # Throw away image to build rdma_core
+   FROM ubuntu:18.04 as rdma_core
+   
+   RUN apt update
+   RUN apt-get install -y dh-make git build-essential cmake gcc libudev-dev libnl-3-dev libnl-route-3-dev ninja-build pkg-config valgrind python3-dev cython3 python3-docutils pandoc
+   
+   RUN git clone --depth 1 --branch v33.0 https://github.com/linux-rdma/rdma-core
+   RUN cd rdma-core && debian/rules binary
+   
+   # Now start the main container
+   FROM nvidia/cuda:${CUDA_VER}-devel-ubuntu18.04
+   
+   COPY --from=rdma_core /*.deb /tmp/
+   
+   RUN apt update
+   RUN apt-get install -y cuda-compat-10-1 wget udev dh-make libnuma1 libudev-dev libnl-3-dev libnl-route-3-dev python3-dev cython3
+   RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v1.9.0/ucx-v1.9.0-ubuntu18.04-mofed5.0-1.0.0.0-cuda10.1.deb
+   RUN dpkg -i /tmp/*.deb && rm -rf /tmp/*.deb
+   ```
+
+### Validating UCX Environment
 
-In order to enable the _RapidsShuffleManager_, please follow these steps. If you don't have 
-Mellanox hardware go to *step 2*:
-
-1. If you have Mellanox NICs and an Infiniband(IB) or RoCE network, please ensure you have the
-[MLNX_OFED driver](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed), and the
-[`nv_peer_mem` kernel module](https://www.mellanox.com/products/GPUDirect-RDMA) installed.
-
-    With `nv_peer_mem`, IB/RoCE-based transfers can perform zero-copy transfers directly from GPU memory.
-
-2. Install [UCX 1.9.0](https://github.com/openucx/ucx/releases/tag/v1.9.0).
-
+After installing UCX you can utilize `ucx_info` and `ucx_perftest` to validate the installation.
 
-3. You will need to configure your spark job with extra settings for UCX (we are looking to 
-simplify these settings in the near future). Choose the version of the shuffle manager
-that matches your Spark version. Currently we support
-   - Spark 3.0.0 (com.nvidia.spark.rapids.spark300.RapidsShuffleManager) 
-   - Spark 3.0.1 (com.nvidia.spark.rapids.spark301.RapidsShuffleManager) 
-   - Spark 3.0.2 (com.nvidia.spark.rapids.spark302.RapidsShuffleManager) 
-   - Spark 3.1.1 (com.nvidia.spark.rapids.spark311.RapidsShuffleManager)
+In this section, we are using a docker container built using the sample dockerfile above. 
 
+1. Test to check whether UCX can link against CUDA:
+    ```
+    root@test-machin:/# ucx_info -d|grep cuda     
+    # Memory domain: cuda_cpy
+    #     Component: cuda_cpy
+    #      Transport: cuda_copy
+    #         Device: cuda
+    # Memory domain: cuda_ipc
+    #     Component: cuda_ipc
+    #      Transport: cuda_ipc
+    #         Device: cuda
+    ```
+
+2. Mellanox device seen by UCX, and what transports are enabled (i.e. `rc`)
+   ```
+   root@test-machine:/# ucx_info -d|grep mlx5_3:1 -B1
+   #      Transport: rc_verbs
+   #         Device: mlx5_3:1
+   --
+   #      Transport: rc_mlx5
+   #         Device: mlx5_3:1
+   --
+   #      Transport: dc_mlx5
+   #         Device: mlx5_3:1
+   --
+   #      Transport: ud_verbs
+   #         Device: mlx5_3:1
+   --
+   #      Transport: ud_mlx5
+   #         Device: mlx5_3:1
+   ```
+
+3. You should be able to execute `ucx_perftest`, and get a good idea that things are working as
+   you expect. 
+
+   Example 1: GPU <-> GPU in the same host. Without NVLink you should expect PCIe speeds. In this
+   case this is PCIe3, and somewhere along the lines of ~10GB/sec is expected. It should also match
+   the performance seen in `p2pBandwidthLatencyTest`, which is included with the cuda toolkit.
+
+   - On server container:
+     ```
+     root@test-server:/# CUDA_VISIBLE_DEVICES=0 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda
+     ```
+
+   - On client container:
+     ```
+     root@test-client:/# CUDA_VISIBLE_DEVICES=1 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda localhost
+     +--------------+--------------+-----------------------------+---------------------+-----------------------+
+     |              |              |      overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+     +--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
+     |    Stage     | # iterations | typical | average | overall |  average |  overall |  average  |  overall  |
+     +--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
+     Final:                  1000     0.000   986.122   986.122     9670.96    9670.96        1014        1014
+     ```
+     
+   Example 2: GPU <-> GPU across the network, using GPUDirectRDMA. You will notice that in this
+   example we picked GPU 3. In our test machine, GPU 3 is closest (same root complex) to the NIC
+   we are using for RoCE, and yields better performance than GPU 0, for example, which is sitting
+   on a different socket. 
+
+   - On server container:
+     ```
+     root@test-server: CUDA_VISIBLE_DEVICES=3 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda
+     ```
+     
+   - On client container:
+     ```
+     root@test-client:/# CUDA_VISIBLE_DEVICES=3 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda test-server
+     +--------------+--------------+-----------------------------+---------------------+-----------------------+
+     |              |              |      overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+     +--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
+     |    Stage     | # iterations | typical | average | overall |  average |  overall |  average  |  overall  |
+     +--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
+     [thread 0]               498     0.000  2016.444  2016.444     4729.49    4729.49         496         496
+     [thread 0]               978     0.000  2088.412  2051.766     4566.50    4648.07         479         487
+     Final:                  1000     0.000  3739.639  2088.899     2550.18    4565.44         267         479
+     ```
+     
+   Example 3: GPU <-> GPU across the network, without GPUDirectRDMA. You will notice that the
+   bandwidth achieved is higher than with GPUDirectRDMA on. This is expected, and a known issue in
+   machines where GPUs and NICs are connected directly to the root complex.
+
+   - On server container:
+     ```
+     root@test-server:/# UCX_IB_GPU_DIRECT_RDMA=no CUDA_VISIBLE_DEVICES=3 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda
+     ```
+
+   - On client container:
+     ```
+     root@test-client:/# UCX_IB_GPU_DIRECT_RDMA=no CUDA_VISIBLE_DEVICES=3 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda test-server
+     +--------------+--------------+-----------------------------+---------------------+-----------------------+
+     |              |              |      overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+     +--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
+     |    Stage     | # iterations | typical | average | overall |  average |  overall |  average  |  overall  |
+     +--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
+     [thread 0]               670     0.000  1497.859  1497.859     6366.91    6366.91         668         668
+     Final:                  1000     0.000  1718.843  1570.784     5548.35    6071.33         582         637
+     ```
+
+### Spark App Configuration
+
+1. Choose the version of the shuffle manager that matches your Spark version. 
+   Currently we support:
+
+    | Spark Shim | spark.shuffle.manager value                              |
+    | -----------| -------------------------------------------------------- |
+    | 3.0.0      | com.nvidia.spark.rapids.spark300.RapidsShuffleManager    |
+    | 3.0.0 EMR  | com.nvidia.spark.rapids.spark300emr.RapidsShuffleManager |
+    | 3.0.1      | com.nvidia.spark.rapids.spark301.RapidsShuffleManager    |
+    | 3.0.1 EMR  | com.nvidia.spark.rapids.spark301emr.RapidsShuffleManager |
+    | 3.0.2      | com.nvidia.spark.rapids.spark302.RapidsShuffleManager    |
+    | 3.1.1      | com.nvidia.spark.rapids.spark311.RapidsShuffleManager    |
+
+2. Recommended settings for UCX 1.9.0+
 ```shell
 ...
 --conf spark.shuffle.manager=com.nvidia.spark.rapids.spark300.RapidsShuffleManager \
 --conf spark.shuffle.service.enabled=false \
 --conf spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,rc,tcp \
 --conf spark.executorEnv.UCX_ERROR_SIGNALS= \
+--conf spark.executorEnv.UCX_RNDV_SCHEME=put_zcopy \
 --conf spark.executorEnv.UCX_MAX_RNDV_RAILS=1 \
 --conf spark.executorEnv.UCX_MEMTYPE_CACHE=n \
 --conf spark.executorEnv.LD_LIBRARY_PATH=/usr/lib:/usr/lib/ucx \
@@ -458,7 +650,7 @@ that matches your Spark version. Currently we support
 Please note `LD_LIBRARY_PATH` should optionally be set if the UCX library is installed in a
 non-standard location.
 
-### UCX Environment Variables
+#### UCX Environment Variables
 - `UCX_TLS`: 
   - `cuda_copy`, and `cuda_ipc`: enables handling of CUDA memory in UCX, both for copy-based transport
     and peer-to-peer communication between GPUs (NVLink/PCIe).
@@ -469,6 +661,13 @@ non-standard location.
   data to utilize various channels (e.g. two NICs). A value greater than `1` can cause a performance drop 
   for high-bandwidth transports between GPUs.
 - `UCX_MEMTYPE_CACHE=n`: Disables a cache in UCX that can cause UCX to fail when running with CUDA buffers. 
+- `UCX_RNDV_SCHEME=put_zcopy`: By default, `UCX_RNDV_SCHEME=auto` will pick different schemes for 
+  the RNDV protocol (`get_zcopy` or `put_zcopy`) depending on message size, and on other parameters 
+  given the hardware, transports, and settings. We have found that `UCX_RNDV_SCHEME=put_zcopy` 
+  is more reliable than automatic detection, or `get_zcopy` in our testing, especially in UCX 1.9.0. 
+  The main difference between get and put is the direction of transfer. A send operation under 
+  `get_zcopy` will really be `RDMA READ` from the receiver, whereas the same send will be 
+  `RDMA_WRITE` from the sender if `put_zcopy` is utilized.
 
 ### RapidsShuffleManager Fine Tuning
 Here are some settings that could be utilized to fine tune the _RapidsShuffleManager_: