Skip to content

Training

Corentin Jemine edited this page Dec 28, 2021 · 6 revisions

This is a step-by-step guide for reproducing the training.

Datasets

There's a lot of data involved in fully training the three models. You'd want at least 500 gb of free space, and that's if you delete datasets after you used them. With 1 tb you're fine.

Ideally, you want to keep all your datasets under a same directory. All prepreprocessing scripts will, by default, output the clean data to a new directory SV2TTS created in your datasets root directory. Inside this directory will be created a directory for each model: the encoder, synthesizer and vocoder.

You will need the following datasets:

For the encoder:

  • LibriSpeech: train-other-500 (extract as LibriSpeech/train-other-500)
  • VoxCeleb1: Dev A - D as well as the metadata file (extract as VoxCeleb1/wav and VoxCeleb1/vox1_meta.csv)
  • VoxCeleb2: Dev A - H (extract as VoxCeleb2/dev)

For the synthesizer and the vocoder:

  • LibriSpeech: train-clean-100, train-clean-360 (extract as LibriSpeech/train-clean-100 and LibriSpeech/train-clean-360)
  • LibriSpeech alignments: take the first link and merge the directory structure with the LibriSpeech datasets you have downloaded (do not take the alignments from the datasets you haven't downloaded else the scripts will think you have them)

Feel free to adapt the code to your needs. Other interesting datasets that you could use include LibriTTS (paper here) which I already tried to implement but couldn't finish in time, VCTK (used in the SV2TTS paper) or M-AILABS.

Preprocessing and training

Here's the great thing about this repo: you're expected to run all python scripts in their alphabetical order. You likely started with the demo scripts, now you can run the remaining ones (pass -h to get argument infos for any script):

python encoder_preprocess.py <datasets_root>

For training, the encoder uses visdom. You can disable it with --no_visdom, but it's nice to have. Run "visdom" in a separate CLI/process to start your visdom server. Then run:

python encoder_train.py my_run <datasets_root>/SV2TTS/encoder

Here's what the visdom environment looks like:

Visdom

Then you have two separate scripts to generate the data of the synthesizer. This is convenient in case you want to retrain the encoder, you will then have to regenerate embeddings for the synthesizer.

Begin with the audios and the mel spectrograms:

python synthesizer_preprocess_audio.py <datasets_root>

Then the embeddings:

python synthesizer_preprocess_embeds.py <datasets_root>/SV2TTS/synthesizer

You can then train the synthesizer:

python synthesizer_train.py my_run <datasets_root>/SV2TTS/synthesizer

The synthesizer will output generated audios and spectrograms to its model directory when training.

Use the synthesizer to generate training data for the vocoder:

python vocoder_preprocess.py <datasets_root>

And finally, train the vocoder:

python vocoder_train.py my_run <datasets_root>

The vocoder also outputs ground truth/generated audios to its model directory.