Algorithms

Index of algorithms provided by Gloo and their semantics.

Variables used:

P: Number of processes/machines
N: Number of buffers per process
S: Size of buffer

Terms used:

Communication steps: number of communication steps. Every communication step has some latency, depending on the transport. Therefore, the fewer steps an algorithm uses, the better it is suited towards higher latency transports. Lower latency transports better tolerate more communication steps.
Bytes on the wire: total number of bytes transmitted per participating process. The higher this number, the sooner an algorithm will be bound by the network bandwidth.

Allreduce

Compute sum of N arrays per process across P processes. This computation happens in place; all input arrays contain the resulting sum after the algorithm completes.

There's 3 phases to each implementation of this algorithm:

Local reduction of N buffers
Allreduce between processes
Broadcast result back to N buffers

allreduce_ring

Communication steps: P-1
Bytes on the wire: P*S

Phase 2 is implemented as follows:

Transmit local result to right side neighbor
Receive buffer from left side neighbor and reduce into local result
Transmit incoming buffer to right side neighbor
Repeat 2-3 until process has seen all data

allreduce_ring_chunked

Communication steps: 4*P
Bytes on the wire: 2*S

Phase 2 is implemented in 2 sub-phases:

First, the algorithm iterates over the local reduction, transmitting chunks of the buffer and reducing at every step. The number of chunks is equal to 2*P, allowing double buffering to be used. This means there is always one chunk in flight while reduction is done on another chunk concurrently. At the end of this phase, every process P holds 1/P of the reduced result.
Second, the algorithm iterates over the local reduction again, now broadcasting the local results.

With 2P chunks and two sub-phases, we arrive at 4P communication steps.

These sub-phases are implemented as followed (roughly):

First:

Compute offset into local reduction buffer based on process rank
Transmit chunk at offset to right side neighbor
Receive chunk at offset-1 from left side neighbor and reduce into local result
Subtract 1 from offset, wrapping when needed
Repeat 2-4 until process has walked entire buffer

Second:

Transmit chunk at offset+1 (containing the global reduction) to right side neighbor
Receive chunk at offset from left side neighbor and copy into local result
Subtract 1 from offset, wrapping when needed
Repeat 1-3 until process has walked entire buffer

cuda_allreduce_ring

CUDA-aware implementation of allreduce_ring. GPU side buffers are copied to system memory in parallel, prior to running local reduction on CPU. After phase 2 completes, CPU side result is copied back to GPU side buffers in parallel.

cuda_allreduce_ring_chunked

CUDA-aware implementation of allreduce_ring_chunked. GPU side buffers are reduced into GPU buffer 0 (using NCCL). The result is copied to system memory asynchronously. After phase 2 completes, the CPU side result is copied back to GPU buffer 0, and then broadcast to other GPU buffers in parallel (using NCCL).

Both local reduction in phase 1 and broadcast in phase 3 is pipelined with the communication steps where this data is needed or becomes available.

Barrier

Synchronization point between processes.

barrier_all_to_all

Communication steps: 1
Bytes on the wire: P

Every process sends a notification to every other process. Then, it waits for a notification from every other process.

barrier_all_to_one

Communication steps: 2
Bytes on the wire: 1 for non-root, P for root

Non-root processes: send notification to root, wait for notification from root.

Root process: wait for notification from P-1 processes, send notification to P-1 processes.

Broadcast

Broadcast contents of buffer on one process to other P-1 processes.

broadcast_one_to_all

Communication steps: 1
Bytes on the wire: P*S

Non-root processes: receive buffer from root.

Root process: send buffer to P-1 processes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

algorithms.md

algorithms.md

Algorithms

Allreduce

allreduce_ring

allreduce_ring_chunked

cuda_allreduce_ring

cuda_allreduce_ring_chunked

Barrier

barrier_all_to_all

barrier_all_to_one

Broadcast

broadcast_one_to_all

Files

algorithms.md

Latest commit

History

algorithms.md

File metadata and controls

Algorithms

Allreduce

allreduce_ring

allreduce_ring_chunked

cuda_allreduce_ring

cuda_allreduce_ring_chunked

Barrier

barrier_all_to_all

barrier_all_to_one

Broadcast

broadcast_one_to_all