Skip to content

Latest commit

 

History

History
131 lines (95 loc) · 4.13 KB

algorithms.md

File metadata and controls

131 lines (95 loc) · 4.13 KB

Algorithms

Index of algorithms provided by Gloo and their semantics.

Variables used:

  • P: Number of processes/machines
  • N: Number of buffers per process
  • S: Size of buffer

Terms used:

  • Communication steps: number of communication steps. Every communication step has some latency, depending on the transport. Therefore, the fewer steps an algorithm uses, the better it is suited towards higher latency transports. Lower latency transports better tolerate more communication steps.
  • Bytes on the wire: total number of bytes transmitted per participating process. The higher this number, the sooner an algorithm will be bound by the network bandwidth.

Allreduce

Compute sum of N arrays per process across P processes. This computation happens in place; all input arrays contain the resulting sum after the algorithm completes.

There's 3 phases to each implementation of this algorithm:

  1. Local reduction of N buffers
  2. Allreduce between processes
  3. Broadcast result back to N buffers

allreduce_ring

  • Communication steps: P-1
  • Bytes on the wire: P*S

Phase 2 is implemented as follows:

  1. Transmit local result to right side neighbor
  2. Receive buffer from left side neighbor and reduce into local result
  3. Transmit incoming buffer to right side neighbor
  4. Repeat 2-3 until process has seen all data

allreduce_ring_chunked

  • Communication steps: 4*P
  • Bytes on the wire: 2*S

Phase 2 is implemented in 2 sub-phases:

  1. First, the algorithm iterates over the local reduction, transmitting chunks of the buffer and reducing at every step. The number of chunks is equal to 2*P, allowing double buffering to be used. This means there is always one chunk in flight while reduction is done on another chunk concurrently. At the end of this phase, every process P holds 1/P of the reduced result.
  2. Second, the algorithm iterates over the local reduction again, now broadcasting the local results.

With 2P chunks and two sub-phases, we arrive at 4P communication steps.

These sub-phases are implemented as followed (roughly):

First:

  1. Compute offset into local reduction buffer based on process rank
  2. Transmit chunk at offset to right side neighbor
  3. Receive chunk at offset-1 from left side neighbor and reduce into local result
  4. Subtract 1 from offset, wrapping when needed
  5. Repeat 2-4 until process has walked entire buffer

Second:

  1. Transmit chunk at offset+1 (containing the global reduction) to right side neighbor
  2. Receive chunk at offset from left side neighbor and copy into local result
  3. Subtract 1 from offset, wrapping when needed
  4. Repeat 1-3 until process has walked entire buffer

cuda_allreduce_ring

CUDA-aware implementation of allreduce_ring. GPU side buffers are copied to system memory in parallel, prior to running local reduction on CPU. After phase 2 completes, CPU side result is copied back to GPU side buffers in parallel.

cuda_allreduce_ring_chunked

CUDA-aware implementation of allreduce_ring_chunked. GPU side buffers are reduced into GPU buffer 0 (using NCCL). The result is copied to system memory asynchronously. After phase 2 completes, the CPU side result is copied back to GPU buffer 0, and then broadcast to other GPU buffers in parallel (using NCCL).

Both local reduction in phase 1 and broadcast in phase 3 is pipelined with the communication steps where this data is needed or becomes available.

Barrier

Synchronization point between processes.

barrier_all_to_all

  • Communication steps: 1
  • Bytes on the wire: P

Every process sends a notification to every other process. Then, it waits for a notification from every other process.

barrier_all_to_one

  • Communication steps: 2
  • Bytes on the wire: 1 for non-root, P for root

Non-root processes: send notification to root, wait for notification from root.

Root process: wait for notification from P-1 processes, send notification to P-1 processes.

Broadcast

Broadcast contents of buffer on one process to other P-1 processes.

broadcast_one_to_all

  • Communication steps: 1
  • Bytes on the wire: P*S

Non-root processes: receive buffer from root.

Root process: send buffer to P-1 processes.