Distributed training experience

This is about distributed training with TensorFlow. This could use distributed TensorFlow (TFDistributed.py in RETURNN, issue #296) or Horovod (RETURNN doc about Horovod) (or a mixture of both). This could use the new TF dataset pipeline (TFDataPipeline.py in RETURNN, issue #292) or the old data pipeline. This might also need to extend some of the existing implementations (all discussions about extending the code should happen in the corresponding GitHub issues, or on Slack).

We care about several settings:

single-node multi-GPU (consumer GPU cards, just TCP/MPI data transfer, slow NFS)
multi-node multi-GPU (consumer GPU cards, just TCP/MPI data transfer, slow NFS)
AWS settings
GCP settings (GPU or also TPU)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training experience

Clone this wiki locally