Skip to content

Distributed training experience

Albert Zeyer edited this page Jun 10, 2020 · 7 revisions

This is about distributed training with TensorFlow. This could use distributed TensorFlow (TFDistributed.py in RETURNN, issue #296) or Horovod (RETURNN doc about Horovod) (or a mixture of both). This could use the new TF dataset pipeline (TFDataPipeline.py in RETURNN, issue #292) or the old data pipeline. This might also need to extend some of the existing implementations (all discussions about extending the code should happen in the corresponding GitHub issues, or on Slack).

We care about several settings:

  • single-node multi-GPU (consumer GPU cards, just TCP/MPI data transfer, slow NFS)
  • multi-node multi-GPU (consumer GPU cards, just TCP/MPI data transfer, slow NFS)
  • AWS settings
  • GCP settings (GPU or also TPU)