Lots of warnings when running prepare.sh #108

gpawlowsky1979 · 2023-05-01T09:51:20Z

When running prepare.sh for preparing the libritts dataset I got lots of warnings like this:

2023-04-30 21:15:25,842 WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1)
2023-04-30 21:15:25,842 WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1)
2023-04-30 21:15:25,843 WARNING [words_mismatch.py:88] words count mismatch on 200.0% of the lines (2/1)
2023-04-30 21:15:25,843 WARNING [words_mismatch.py:88] words count mismatch on 200.0% of the lines (2/1)
2023-04-30 21:15:25,843 WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1)
2023-04-30 21:15:25,844 WARNING [words_mismatch.py:88] words count mismatch on 200.0% of the lines (2/1)
2023-04-30 21:15:25,844 WARNING [words_mismatch.py:88] words count mismatch on 300.0% of the lines (3/1)
2023-04-30 21:15:25,844 WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1)
2023-04-30 21:15:25,845 WARNING [words_mismatch.py:88] words count mismatch on 200.0% of the lines (2/1)
2023-04-30 21:15:25,845 WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1)
2023-04-30 21:15:25,845 WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1)
2023-04-30 21:15:25,845 WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1)

I looks like every single line in the dataset has this kind of problem, so I don't think it's something that can just be ignored safely. I got similar warnings later when using infer.py.
Despite these errors, I was able to train the model and after 60 epochs (20 AR + 40 NAR) it is capable of generating intelligible speech, but it doesn't resemble much the voices I use as an input. This might be due to underfitting, but I'm concerned it may also be related to the warnings in the dataset I mentioned. I also had to reduce a bit the max-duration parameter in order to run on a 16GB GPU.
Here's my tensorboard image after 40 epochs on NAR:

Anybody got some luck getting good speech generation that really resembles input voices after training on LibriTTS?
Also, what's the difference between vall-e and vall-f models? I haven't found much information about vall-f. Is it any better than vall-e?

debasishaimonk · 2023-05-10T05:01:49Z

@gpawlowsky1979 Hi what all changes u did inorder to train it,i meant hyperparamters? and how many distinct speakers that you have used to train it.

gpawlowsky1979 · 2023-05-10T16:47:39Z

After 70 epochs the results are better and now the voices resemble more the ones used as input. Perhaps I had unrealistic expectations about how good the generated voices would sound.
However, I'm still concerned about the warning messages when preparing the dataset, and I just used the default command:

bash prepare.sh --stage -1 --stop-stage 3

I think the results may have been better if the dataset was properly prepared, without so many word count mismatches.

I trained it on the libritts dataset. Here are the parameters I used:

## Train AR model
python3 bin/trainer.py --max-duration 50 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 1 \
      --num-buckets 6 --dtype "bfloat16" --save-every-n 10000 --valid-interval 20000 \
      --model-name valle --share-embedding true --norm-first true --add-prenet false \
      --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \
      --base-lr 0.05 --warmup-steps 200 --average-period 0 \
      --num-epochs 20 --start-epoch 1 --start-batch 0 --accumulate-grad-steps 4 \
      --exp-dir ${exp_dir} --tensorboard true

## Train NAR model
cp ${exp_dir}/best-valid-loss.pt ${exp_dir}/epoch-2.pt  # --start-epoch 3=2+1
python3 bin/trainer.py --max-duration 36 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 2 \
      --num-buckets 6 --dtype "float32" --save-every-n 10000 --valid-interval 20000 \
      --model-name valle --share-embedding true --norm-first true --add-prenet false \
      --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \
      --base-lr 0.05 --warmup-steps 200 --average-period 0 \
      --num-epochs 40 --start-epoch 3 --start-batch 0 --accumulate-grad-steps 4 \
      --exp-dir ${exp_dir} --tensorboard true

salaxieb · 2023-06-08T18:20:23Z

@gpawlowsky1979 I'm also training the model.
Can, you, please, share your checkpoint, so I will not have to start from scratch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lots of warnings when running prepare.sh #108

Lots of warnings when running prepare.sh #108

gpawlowsky1979 commented May 1, 2023

debasishaimonk commented May 10, 2023

gpawlowsky1979 commented May 10, 2023

salaxieb commented Jun 8, 2023

Lots of warnings when running prepare.sh #108

Lots of warnings when running prepare.sh #108

Comments

gpawlowsky1979 commented May 1, 2023

debasishaimonk commented May 10, 2023

gpawlowsky1979 commented May 10, 2023

salaxieb commented Jun 8, 2023