Skip to content

Commit

Permalink
More documentation
Browse files Browse the repository at this point in the history
Summary: TSIA

Reviewed By: andrewwdye

Differential Revision: D4644734

fbshipit-source-id: 50f5fadd2c5cd04e06a025f5538187ed852e669a
  • Loading branch information
pietern authored and facebook-github-bot committed Mar 2, 2017
1 parent 837023b commit 70fc15c
Show file tree
Hide file tree
Showing 3 changed files with 153 additions and 13 deletions.
79 changes: 68 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,28 @@ IP can be used at all times, or InifiniBand (or RoCE) when available.

Where applicable, algorithms have an implementation that works with
system memory buffers, and one that works with NVIDIA GPU memory
buffers. In the latter case, if the InfiniBand transport is used,
GPUDirect can be used to accelerate cross machine GPU-to-GPU memory
transfers.
buffers. In the latter case, it is not necessary to copy memory between
host and device; this is taken care of by the algorithm implementations.

## Requirements
Gloo is built to run on Linux and has no hard dependencies other than libc.
Gloo is built to run on Linux and has no hard dependencies other than libstdc++.

Optional dependencies are:
* cuda -- for CUDA algorithms, tests, and benchmark
* googletest -- to build and run tests
* eigen -- for fast floating point routines
* hiredis -- for coordinating machine rendezvous through Redis
* **[CUDA][cuda] and [NCCL][nccl]** -- for CUDA aware algorithms, tests, and benchmark
* **[Google Test][gtest]** -- to build and run tests
* **[Eigen][eigen]** -- for fast floating point routines
* **[Hiredis][hiredis]** -- for coordinating machine rendezvous through Redis

## Usage
[cuda]: http://www.nvidia.com/object/cuda_home_new.html
[nccl]: https://github.com/nvidia/nccl
[gtest]: https://github.com/google/googletest
[eigen]: http://eigen.tuxfamily.org
[hiredis]: https://github.com/redis/hiredis

## Documentation
Please refer to [docs/](docs/) for detailed documentation.

## Building
You can build Gloo using CMake.

Since it is a library, it is most convenient to vendor it in your own
Expand All @@ -45,8 +53,57 @@ cmake ../ -DBUILD_TEST=1 -DBUILD_BENCHMARK=1
ls -l gloo/gloo_{test,benchmark}
```

## Documentation
Please refer to [docs/](docs/) for detailed documentation.
## Benchmarking
The benchmark tool depends on 1) Eigen for floating point math and 2)
Redis/Hiredis for rendezvous. The benchmark tool for CUDA algorithms
obviously also depends on both CUDA and NCCL.

To run a benchmark:
1. Copy the benchmark tool to all participating machines
2. Start a Redis server on any host (either a client machine or one of
the machines participating in the test).
3. Determine some unique ID for the benchmark run (e.g. the `uuid`
tool or some number).
4. On each machine, run (or pass `--help` for more options):

``` text
./benchmark \
--size <number of machines> \
--rank <index of this machine, starting at 0> \
--redis-host <Redis host> \
--redis-port <Redis port> \
--prefix <unique identifier for this run> \
--transport tcp \
--elements <number of elements; -1 for a sweep> \
--iteration-time 1s \
allreduce_ring_chunked
```
Example output (running on 4 machines with a 40GbE network):

``` text
elements min (us) p50 (us) p99 (us) max (us) samples
1 195 263 342 437 3921
2 195 261 346 462 4039
5 197 261 339 402 3963
10 197 263 338 398 3749
20 199 268 343 395 4146
50 200 265 344 401 3889
100 205 265 351 414 3645
200 197 264 328 387 3960
500 201 264 329 394 4274
1000 200 267 330 380 3344
2000 205 263 323 395 3682
5000 240 335 424 460 3277
10000 271 346 402 457 2721
20000 283 358 392 428 2719
50000 342 438 495 649 1654
100000 413 487 669 799 1687
200000 1113 1450 1837 2801 669
500000 1099 1294 1665 1959 560
1000000 1858 2286 2779 6100 320
2000000 3546 3993 4364 4886 252
5000000 10030 10608 11106 11628 92
```

## License
Gloo is BSD-licensed. We also provide an additional patent grant.
2 changes: 0 additions & 2 deletions docs/overview.md

This file was deleted.

85 changes: 85 additions & 0 deletions docs/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Gloo documentation

Documentation is split by domain. This file contains a general
overview of these domains and how they interact.

## Index

* [Overview](readme.md) -- this file

* [Transport details](transport.md) -- the transport API and its
implementations

* [CUDA integration](cuda.md) -- integration of CUDA aware Gloo
algorithms with existing CUDA code

* [Latency optimization](latency.md) -- series of tips and tricks to
improve algorithm latency

## Overview

Gloo algorithms are collective algoritms, meaning they can run in
parallel across two or more processes/machines. To be able to execute
across multiple machines, they first need to find each other. We call
this _rendezvous_ and it is the first thing to address when
integrating Gloo into your code base.

Once rendezvous completes, participating machines have setup
connections to one another, either in a full mesh (every machine has a
bidirectional communication channel to every other machine), or some
subset. The required connectivity between machines depends on the type
of algorithm that is used. For example, a ring algorithm only needs
communication channels to a machine's neighbors.

Every participating process knows about the number of participating
processes, and its _rank_ (or 0-based index) within the list of
participating processes. This state, as well as the state needed to
store the persistent communication channels, is stored in a
`gloo::Context` class. Gloo does not maintain global state or
thread-local state. This means that you can setup as many contexts as
needed, and introduce as much parallelism as needed by your
application.

## Rendezvous

The rendezvous process needs to happen exactly once per Gloo context.
It makes participating Gloo processes exchange details for setting up
their communication channels. For example, when the TCP transport is
used, processes exchange IP address and port number details of
listening sockets.

Rendezvous is abstracted as a key/value interface to a store that is
accessible by all participating processes. Every process is
responsible for setting a number of keys and will wait until their
peers have set their keys. The values stored against these keys hold
the information that is passed to the transport layer.

This interface is defined in [`store.h`](../gloo/rendezvous/store.h).

### HashStore

The [HashStore](../gloo/rendezvous/hash_store.cc) is an in-process
implementation of this interface. This is realistically not useful in
any application but integration tests.

### RedisStore

The [RedisStore](../gloo/rendezvous/redis_store.cc) implementation uses
the Hiredis library to set/get values against a Redis server. This
server needs to be accessible to all participating machines.

Since the keys used by the Redis implementation are accessible to any
process using that server -- which would prevent usage for concurrent
rendezvous executation -- the
[PrefixStore](../gloo/rendezvous/prefix_store.cc) can be used to scope
rendezvous to a particular namespace.

### ...

Any class that inherits from the `gloo::rendezvous::Store` abstract
base class can be used for rendezvous.

## Anything else?

If you find particular documentation is missing, please consider
[contributing](../CONTRIBUTING.md).

0 comments on commit 70fc15c

Please sign in to comment.