More documentation

Summary: TSIA Reviewed By: andrewwdye Differential Revision: D4644734 fbshipit-source-id: 50f5fadd2c5cd04e06a025f5538187ed852e669a
lw · Mar 2, 2017 · 70fc15c · 70fc15c
1 parent 837023b
commit 70fc15c
Show file tree

Hide file tree

Showing 3 changed files with 153 additions and 13 deletions.
diff --git a/README.md b/README.md
@@ -8,20 +8,28 @@ IP can be used at all times, or InifiniBand (or RoCE) when available.
 
 Where applicable, algorithms have an implementation that works with
 system memory buffers, and one that works with NVIDIA GPU memory
-buffers. In the latter case, if the InfiniBand transport is used,
-GPUDirect can be used to accelerate cross machine GPU-to-GPU memory
-transfers.
+buffers. In the latter case, it is not necessary to copy memory between
+host and device; this is taken care of by the algorithm implementations.
 
 ## Requirements
-Gloo is built to run on Linux and has no hard dependencies other than libc.
+Gloo is built to run on Linux and has no hard dependencies other than libstdc++.
 
 Optional dependencies are:
-* cuda -- for CUDA algorithms, tests, and benchmark
-* googletest -- to build and run tests
-* eigen -- for fast floating point routines
-* hiredis -- for coordinating machine rendezvous through Redis
+* **[CUDA][cuda] and [NCCL][nccl]** -- for CUDA aware algorithms, tests, and benchmark
+* **[Google Test][gtest]** -- to build and run tests
+* **[Eigen][eigen]** -- for fast floating point routines
+* **[Hiredis][hiredis]** -- for coordinating machine rendezvous through Redis
 
-## Usage
+[cuda]: http://www.nvidia.com/object/cuda_home_new.html
+[nccl]: https://github.com/nvidia/nccl
+[gtest]: https://github.com/google/googletest
+[eigen]: http://eigen.tuxfamily.org
+[hiredis]: https://github.com/redis/hiredis
+
+## Documentation
+Please refer to [docs/](docs/) for detailed documentation.
+
+## Building
 You can build Gloo using CMake.
 
 Since it is a library, it is most convenient to vendor it in your own
@@ -45,8 +53,57 @@ cmake ../ -DBUILD_TEST=1 -DBUILD_BENCHMARK=1
 ls -l gloo/gloo_{test,benchmark}
 ```
 
-## Documentation
-Please refer to [docs/](docs/) for detailed documentation.
+## Benchmarking
+The benchmark tool depends on 1) Eigen for floating point math and 2)
+Redis/Hiredis for rendezvous. The benchmark tool for CUDA algorithms
+obviously also depends on both CUDA and NCCL.
+
+To run a benchmark:
+1. Copy the benchmark tool to all participating machines
+2. Start a Redis server on any host (either a client machine or one of
+   the machines participating in the test).
+3. Determine some unique ID for the benchmark run (e.g. the `uuid`
+   tool or some number).
+4. On each machine, run (or pass `--help` for more options):
+
+    ``` text
+    ./benchmark \
+      --size <number of machines> \
+      --rank <index of this machine, starting at 0> \
+      --redis-host <Redis host> \
+      --redis-port <Redis port> \
+      --prefix <unique identifier for this run> \
+      --transport tcp \
+      --elements <number of elements; -1 for a sweep> \
+      --iteration-time 1s \
+      allreduce_ring_chunked
+    ```
+Example output (running on 4 machines with a 40GbE network):
+
+``` text
+   elements   min (us)   p50 (us)   p99 (us)   max (us)    samples
+          1        195        263        342        437       3921
+          2        195        261        346        462       4039
+          5        197        261        339        402       3963
+         10        197        263        338        398       3749
+         20        199        268        343        395       4146
+         50        200        265        344        401       3889
+        100        205        265        351        414       3645
+        200        197        264        328        387       3960
+        500        201        264        329        394       4274
+       1000        200        267        330        380       3344
+       2000        205        263        323        395       3682
+       5000        240        335        424        460       3277
+      10000        271        346        402        457       2721
+      20000        283        358        392        428       2719
+      50000        342        438        495        649       1654
+     100000        413        487        669        799       1687
+     200000       1113       1450       1837       2801        669
+     500000       1099       1294       1665       1959        560
+    1000000       1858       2286       2779       6100        320
+    2000000       3546       3993       4364       4886        252
+    5000000      10030      10608      11106      11628         92
+```
 
 ## License
 Gloo is BSD-licensed. We also provide an additional patent grant.
diff --git a/docs/overview.md b/docs/overview.md
diff --git a/docs/readme.md b/docs/readme.md
@@ -0,0 +1,85 @@
+# Gloo documentation
+
+Documentation is split by domain. This file contains a general
+overview of these domains and how they interact.
+
+## Index
+
+* [Overview](readme.md) -- this file
+
+* [Transport details](transport.md) -- the transport API and its
+  implementations
+
+* [CUDA integration](cuda.md) -- integration of CUDA aware Gloo
+  algorithms with existing CUDA code
+
+* [Latency optimization](latency.md) -- series of tips and tricks to
+  improve algorithm latency
+
+## Overview
+
+Gloo algorithms are collective algoritms, meaning they can run in
+parallel across two or more processes/machines. To be able to execute
+across multiple machines, they first need to find each other. We call
+this _rendezvous_ and it is the first thing to address when
+integrating Gloo into your code base.
+
+Once rendezvous completes, participating machines have setup
+connections to one another, either in a full mesh (every machine has a
+bidirectional communication channel to every other machine), or some
+subset. The required connectivity between machines depends on the type
+of algorithm that is used. For example, a ring algorithm only needs
+communication channels to a machine's neighbors.
+
+Every participating process knows about the number of participating
+processes, and its _rank_ (or 0-based index) within the list of
+participating processes. This state, as well as the state needed to
+store the persistent communication channels, is stored in a
+`gloo::Context` class. Gloo does not maintain global state or
+thread-local state. This means that you can setup as many contexts as
+needed, and introduce as much parallelism as needed by your
+application.
+
+## Rendezvous
+
+The rendezvous process needs to happen exactly once per Gloo context.
+It makes participating Gloo processes exchange details for setting up
+their communication channels. For example, when the TCP transport is
+used, processes exchange IP address and port number details of
+listening sockets.
+
+Rendezvous is abstracted as a key/value interface to a store that is
+accessible by all participating processes. Every process is
+responsible for setting a number of keys and will wait until their
+peers have set their keys. The values stored against these keys hold
+the information that is passed to the transport layer.
+
+This interface is defined in [`store.h`](../gloo/rendezvous/store.h).
+
+### HashStore
+
+The [HashStore](../gloo/rendezvous/hash_store.cc) is an in-process
+implementation of this interface. This is realistically not useful in
+any application but integration tests.
+
+### RedisStore
+
+The [RedisStore](../gloo/rendezvous/redis_store.cc) implementation uses
+the Hiredis library to set/get values against a Redis server. This
+server needs to be accessible to all participating machines.
+
+Since the keys used by the Redis implementation are accessible to any
+process using that server -- which would prevent usage for concurrent
+rendezvous executation -- the
+[PrefixStore](../gloo/rendezvous/prefix_store.cc) can be used to scope
+rendezvous to a particular namespace.
+
+### ...
+
+Any class that inherits from the `gloo::rendezvous::Store` abstract
+base class can be used for rendezvous.
+
+## Anything else?
+
+If you find particular documentation is missing, please consider
+[contributing](../CONTRIBUTING.md).