Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After 100 epochs training, the model can synthesize natural speech on LibriTTS #58

Open
dohe0342 opened this issue Mar 20, 2023 · 62 comments

Comments

@dohe0342
Copy link

I trained vall-e on LibriTTS about 100 epochs (took almost 4 days on 8 A100 GPUs) and I obtained plausible synthesized audio.

Here is a demo.
[1]
prompt : prompt_link
synthesized audio : synt_link

[2]
prompt : prompt_link
ground truth : gt_link
synthesized audio : synt_link

[3]
prompt : prompt_link
synthesized audio : synt_link

[4]
prompt : prompt_link
ground truth : gt_link
synthesized audio : synt_link

The model I trained has worse quality than original vall-e because of dataset amount. However, It has a promising quality in clean audio.
I'm not sure whether I can share my pre-trained LibriTTS model. If I can, I would like to share the pre-trained LibriTTS model.

@hdmjdp
Copy link

hdmjdp commented Mar 20, 2023

@dohe0342 prefix 0? and can you share your config?

@dohe0342
Copy link
Author

dohe0342 commented Mar 20, 2023

@hdmjdp

I can't understand prefix. What does it mean?

Here is the shell script i ran. I just changed the "num-epochs", "max-duration" and "world-size".

./run.sh --stage 4 --stop-stage 4 --max-duration 50 --filter-max-duration 14 --num-decoder-layers 12 --world-size 8 --num-epochs 100

@hdmjdp
Copy link

hdmjdp commented Mar 20, 2023

@dohe0342

The mode for how to prefix VALL-E NAR Decoder, "

54 | "0: no prefix, 1: 0 to random, 2: random to random.",

@hdmjdp
Copy link

hdmjdp commented Mar 20, 2023

@dohe0342 can you share your tensorboard image?

@dohe0342
Copy link
Author

dohe0342 commented Mar 20, 2023

@hdmjdp

I ran vall-e last week version which has no prefix option. And I found prefix 0 is same as vall-e last week version version.

Here is my tensorboard image. I ran 177 epochs actually but 100-epoch checkpoint was used to generate audios.
image

I'll soon upload tensorboard image. Please wait.

@dohe0342
Copy link
Author

@hdmjdp

Here is my tensorboard log.
tensorboard

@thangnvkcn
Copy link

thangnvkcn commented Mar 20, 2023

@dohe0342
Can you share the pre-trained LibriTTS model for me, if possible please send it to me at thangmta30@gmail.com

@hdmjdp
Copy link

hdmjdp commented Mar 20, 2023

@hdmjdp Can you share the pre-trained LibriTTS model for me, if possible please send it to me at thangmta30@gmail.com

Not me

@hdmjdp
Copy link

hdmjdp commented Mar 20, 2023

@hdmjdp

Here is my tensorboard log. tensorboard

Thanks. Whether the prompt speaker of your demo wav in your training data?

@dohe0342
Copy link
Author

@hdmjdp

The prompt speakers are in test-clean not training data.

@lifeiteng
Copy link
Owner

@dohe0342 Thank you for sharing this.

@shanhaidexiamo
Copy link

based on the latest commit? Thanks

@dohe0342
Copy link
Author

based on the latest commit? Thanks
based on last week commit. thank you

@liuxun666
Copy link

mark

@jieen1
Copy link

jieen1 commented Mar 21, 2023

@dohe0342 can you share this model for me? wangjiashejieen@gmail.com here is my email. Thanks.

@LorenzoBrugioni
Copy link

LorenzoBrugioni commented Mar 21, 2023

Hey @dohe0342 , great work! Would you think it could be possible to share the pre-trained model? 🙏🏻🙏🏻🙏🏻
Just in case, here's my email : lori.brugio@gmail.com

@UncleSens
Copy link

Thank you for your contribution @dohe0342!
In case it's possible to share the model, would you please send it to me?
Here is my email: senqiu37@gmail.com

@Zhang-Xiaoyi
Copy link

@dohe0342 Very nice results! Can you share your trained model if it is possible? my email is zhangxiaoyi1127@gmail.com

@lqj01
Copy link

lqj01 commented Mar 23, 2023

@dohe0342 Very nice results! Can you share your trained model if it is possible? my email is liqianjin2018@gmail.com

@WendongGan
Copy link

@dohe0342 I'm interested in your pre-training model. Can you share your pre-training model with me? Thank you! my email is : 15982350806@163.com

@yiwei0730
Copy link

I'm so interested in your pre-training model. The result was amazing, can you share the pretrained model with me ? I would be very appreciated. my email : yiwei110181@gmail.com

@hackerxiaobai
Copy link

Very nice results! Can you share your trained model if it is possible? my email is wl_9322@163.com

@hardik7
Copy link

hardik7 commented Mar 24, 2023

@dohe0342 Interesting results! Could you please try synthesising audio from a cartoon character's audio prompt, something like this: https://drive.google.com/file/d/11NDZzopniwIFJa8dr4hAKp2md8cxel4w/view?usp=sharing
Curious to know how would VALL-E's output sound like with non-human like voices.
Thanks!

@dohe0342
Copy link
Author

@thangnvkcn @jieen1 @LorenzoBrugioni @UncleSens @Zhang-Xiaoyi @lqj01 @UESTCgan @yiwei0730 @hackerxiaobai

Sorry for late reply. This is the model that I trained.
google drive link : link

infer like this command:
python bin/infer.py --output-dir ./ --model-name valle --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --text-prompts "KNOT one point one five miles per hour." --audio-prompts ./prompts/8463_294825_000043_000000.wav --text "To get up and running quickly just follow the steps below." --checkpoint exp/epoch-100.pt

@hardik7 I shared my pre-trained model. So you can synthesize the cartoon audio. But I trained my model using LibriTTS which is composed of 550 hours human audiobook. And original vall-e was trained on librilight which has 60k hours audio.

So, my pre-trained model has no capability to synthesize cartoon audio since lack of cartoon train set and lack of dataset amount.

@Zhang-Xiaoyi
Copy link

@thangnvkcn @jieen1 @LorenzoBrugioni @UncleSens @Zhang-Xiaoyi @lqj01 @UESTCgan @yiwei0730 @hackerxiaobai

Sorry for late reply. This is the model that I trained. google drive link : link

infer like this command: python bin/infer.py --output-dir ./ --model-name valle --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --text-prompts "KNOT one point one five miles per hour." --audio-prompts ./prompts/8463_294825_000043_000000.wav --text "To get up and running quickly just follow the steps below." --checkpoint exp/epoch-100.pt

@hardik7 I shared my pre-trained model. So you can synthesize the cartoon audio. But I trained my model using LibriTTS which is composed of 550 hours human audiobook. And original vall-e was trained on librilight which has 60k hours audio.

So, my pre-trained model has no capability to synthesize cartoon audio since lack of cartoon train set and lack of dataset amount.

Thanks for sharing. I have trained a model using the same config as yours. I just checked the ckpt at 30 epoch and it produces quite good results. I will compare with your ckpt at 100 epoch.

@hardik7
Copy link

hardik7 commented Mar 28, 2023

@thangnvkcn @jieen1 @LorenzoBrugioni @UncleSens @Zhang-Xiaoyi @lqj01 @UESTCgan @yiwei0730 @hackerxiaobai

Sorry for late reply. This is the model that I trained. google drive link : link

infer like this command: python bin/infer.py --output-dir ./ --model-name valle --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --text-prompts "KNOT one point one five miles per hour." --audio-prompts ./prompts/8463_294825_000043_000000.wav --text "To get up and running quickly just follow the steps below." --checkpoint exp/epoch-100.pt

@hardik7 I shared my pre-trained model. So you can synthesize the cartoon audio. But I trained my model using LibriTTS which is composed of 550 hours human audiobook. And original vall-e was trained on librilight which has 60k hours audio.

So, my pre-trained model has no capability to synthesize cartoon audio since lack of cartoon train set and lack of dataset amount.

Thank you @dohe0342. Will do some experiments with non-human voices and will train my own model with the relevant dataset.

@OnceJune
Copy link

OnceJune commented Mar 28, 2023

@dohe0342 hi, have you evaluated the inference speed? what's the RTF when generate audio? How about the correctness of the pronunciation?

@bprimal22
Copy link

@dohe0342
Is it possible to train on top of your trained model?

@lifeiteng
Copy link
Owner

lifeiteng commented Mar 31, 2023

@hdmjdp

I ran vall-e last week version which has no prefix option. And I found prefix 0 is same as vall-e last week version version.

Here is my tensorboard image. I ran 177 epochs actually but 100-epoch checkpoint was used to generate audios. image

I'll soon upload tensorboard image. Please wait.

@dohe0342
It should be --prefix-mode 1, can you test --prefix-mode 1 on branch stage #59

@codehappy-net
Copy link

I received similar errors and had to install earlier versions of Python and torch to resolve them. It's just Python ML dependency hell; breaking API changes occur from version to version for little real reason. The versions of the big dependencies installed in my working VALL-E conda environment are:

python: 3.8.0
torch: 1.13.1+cu116
numpy: 1.22.4

@no-Seaweed
Copy link

I received similar errors and had to install earlier versions of Python and torch to resolve them. It's just Python ML dependency hell; breaking API changes occur from version to version for little real reason. The versions of the big dependencies installed in my working VALL-E conda environment are:

python: 3.8.0 torch: 1.13.1+cu116 numpy: 1.22.4

Do you mean you switch your previous dependency to torch 1.13.1+cu116 and etc, and the missing key problem is solved?

@catalwaysright
Copy link

I received similar errors and had to install earlier versions of Python and torch to resolve them. It's just Python ML dependency hell; breaking API changes occur from version to version for little real reason. The versions of the big dependencies installed in my working VALL-E conda environment are:
python: 3.8.0 torch: 1.13.1+cu116 numpy: 1.22.4

Do you mean you switch your previous dependency to torch 1.13.1+cu116 and etc, and the missing key problem is solved?

I solved this error when I switch the rep version to v0.1.0. It can infer successfully but the output is nothing but noise. Could you @dohe0342 please reveal the code zip you used to train this model? Thanks in advance.

@no-Seaweed
Copy link

no-Seaweed commented Apr 23, 2023

I received similar errors and had to install earlier versions of Python and torch to resolve them. It's just Python ML dependency hell; breaking API changes occur from version to version for little real reason. The versions of the big dependencies installed in my working VALL-E conda environment are:
python: 3.8.0 torch: 1.13.1+cu116 numpy: 1.22.4

Do you mean you switch your previous dependency to torch 1.13.1+cu116 and etc, and the missing key problem is solved?

I solved this error when I switch the rep version to v0.1.0. It can infer successfully but the output is nothing but noise. Could you @dohe0342 please reveal the code zip you used to train this model? Thanks in advance.

Finally, I find the corresponding version of code and was able to produce a correct output. Please find b83653a.

@eschmidbauer
Copy link

i cannot run inference with the pretrained model provided, i get the following error:

RuntimeError: Error(s) in loading state_dict for VALLE:
	Missing key(s) in state_dict: "ar_text_embedding.word_embeddings.weight", "nar_text_embedding.word_embeddings.weight", "ar_audio_embedding.word_embeddings.weight", "ar_text_position.alpha", "ar_audio_position.alpha", "ar_predict_layer.weight", "nar_audio_embeddings.0.word_embeddings.weight", "nar_audio_embeddings.1.word_embeddings.weight", "nar_audio_embeddings.2.word_embeddings.weight", "nar_audio_embeddings.3.word_embeddings.weight", "nar_audio_embeddings.4.word_embeddings.weight", "nar_audio_embeddings.5.word_embeddings.weight", "nar_audio_embeddings.6.word_embeddings.weight", "nar_audio_embeddings.7.word_embeddings.weight", "nar_text_position.alpha", "nar_audio_position.alpha", "nar_predict_layers.0.weight", "nar_predict_layers.1.weight", "nar_predict_layers.2.weight", "nar_predict_layers.3.weight", "nar_predict_layers.4.weight", "nar_predict_layers.5.weight", "nar_predict_layers.6.weight", "nar_stage_embeddings.0.word_embeddings.weight", "nar_stage_embeddings.1.word_embeddings.weight", "nar_stage_embeddings.2.word_embeddings.weight", "nar_stage_embeddings.3.word_embeddings.weight", "nar_stage_embeddings.4.word_embeddings.weight", "nar_stage_embeddings.5.word_embeddings.weight", "nar_stage_embeddings.6.word_embeddings.weight".
	Unexpected key(s) in state_dict: "text_embedding.word_embeddings.weight", "ar_embedding.word_embeddings.weight", "nar_embeddings.0.word_embeddings.weight", "nar_embeddings.1.word_embeddings.weight", "nar_embeddings.2.word_embeddings.weight", "nar_embeddings.3.word_embeddings.weight", "nar_embeddings.4.word_embeddings.weight", "nar_embeddings.5.word_embeddings.weight", "nar_embeddings.6.word_embeddings.weight", "nar_embeddings.7.word_embeddings.weight", "text_position.alpha", "audio_positions.0.alpha", "audio_positions.1.alpha", "audio_positions.2.alpha", "audio_positions.3.alpha", "audio_positions.4.alpha", "audio_positions.5.alpha", "audio_positions.6.alpha", "audio_positions.7.alpha", "stage_embeddings.0.word_embeddings.weight", "stage_embeddings.1.word_embeddings.weight", "stage_embeddings.2.word_embeddings.weight", "stage_embeddings.3.word_embeddings.weight", "stage_embeddings.4.word_embeddings.weight", "stage_embeddings.5.word_embeddings.weight", "stage_embeddings.6.word_embeddings.weight", "stage_embeddings.7.word_embeddings.weight", "predict_layers.0.weight", "predict_layers.1.weight", "predict_layers.2.weight", "predict_layers.3.weight", "predict_layers.4.weight", "predict_layers.5.weight", "predict_layers.6.weight", "predict_layers.7.weight".

@RuntimeRacer
Copy link
Contributor

@eschmidbauer you'll need to checkout the commit referenced in the comment #58 (comment) right above yours.

@eschmidbauer
Copy link

thanks! i'll give that a try, i was actually able to continue training with the model though.

@etwk
Copy link

etwk commented May 3, 2023

Thanks for sharing the model.

Tried the epoch, could generate a similar style of audio when text prompts and output text are all short. Will result in the warning below when either text prompts or output text are longer:

WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1).

What's the proper way to generate longer audio?

@debasishaimonk
Copy link

@dohe0342 on how many distinct speakers that u trained on, which you have shared your trained model?

@cantabile-kwok
Copy link

Thanks for sharing the model.

Tried the epoch, could generate a similar style of audio when text prompts and output text are all short. Will result in the warning below when either text prompts or output text are longer:

WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1).

What's the proper way to generate longer audio?

Not only this warning keeps to exist, but also the generated audio misses, babbles or repeats a lot of words if the text is relatively long. Is it because of 100 epochs on LibriTTS are still not able to learn well?

@chenjiasheng
Copy link
Collaborator

chenjiasheng commented May 23, 2023 via email

@cantabile-kwok
Copy link

So, did you encounter this long waveform issue as well @chenjiasheng? Actually my test sentences come from LibriTTS test set, and it is supposed to have the same length distribution on the train set. I found that, using the checkpoint released in this thread, the model can hardly generate 100% correct speech if it is longer than something around 12s. Technically that length should not be regarded as "very long" usually. Does it mean that when training this model, most of the data longer than 12s are dropped by some way?

@lifeiteng
Copy link
Owner

@cantabile-kwok more info about words count mismatch #5

@chenjiasheng
Copy link
Collaborator

chenjiasheng commented May 30, 2023 via email

@temporaryharry
Copy link

Could anyone send me the "unique_text_tokens.k2symbols" because it is not in valle\egs\libritts\data\tokenized without training my email is grandmaskisses342@gmail.com

@sjoon2455
Copy link

sjoon2455 commented Jun 20, 2023

Is 100 epoch ar and nar model each? The code has changed now, so I was wondering :)
I have reproduced the training but it seems to have a bit different performance (and mine took for about 1.5 days to train 100 epoch each! on 8 * A100 gpus!)

@JonathanColetti
Copy link

@sjoon2455 can you share your tensorboard?

@KeiKinn
Copy link

KeiKinn commented Jun 23, 2023

Many of us encountered the missing keys problem when loading the pretrained model. If anyone wants to use the pretrained model provided by @dohe0342, the main trick is that you should checkout to the right commit or any commit with the same valle model, and then reinstall valle by
pip uninstall valle; pip install -e .
Since when we tried to initialize a new model, python will use the valle installed in env instead of the source code.

@raikarsagar
Copy link

@dohe0342 Thanks for sharing the pretrained model which is trained for 100 epochs. When we say, 100 epochs, is it 100 each for AR and NAR or combined numbers where we start with an AR model (probably 50 epochs)? Pls clarify. I have trained a model for 100 epochs but quality isnt as good as shared by you here at the beginning.

Thanks in advance
Sagar

@nathanodle
Copy link

Have you or has anyone else done further training? Also, which Libre dataset (size) was it? Thanks!

@RoyandZoe
Copy link

@dohe0342 I'm interested in your pre-training model. Can you share your pre-training model with me? Thank you! my email is : xlwj_sd@163.com

@Abdulk084
Copy link

how

epoch-100.pt

how does epoch-100.pt works with the inferences code provided in this repo
as ar.pt and nar.pt are needed?

@AI-ctrl
Copy link

AI-ctrl commented Nov 3, 2023

@dohe0342 I'm interested in your pre-training model. Can you share your pre-training model with me? Thank you! my email is : ajeet9698@gmail.com

@AI-ctrl
Copy link

AI-ctrl commented Nov 3, 2023

@dohe0342 I'm interested in your pre-training model. Can you share your pre-training model with me? Thank you! my email is : ajeet9698@gmail.com

and will it work if i want to train it on specific set of voices let's say 10 or 15 persons set of voices

@liuyuhualilith
Copy link

@thangnvkcn @jieen1 @LorenzoBrugioni @UncleSens @Zhang-Xiaoyi @lqj01@电子科技大学根@yiwei0730 @hackerxiaobai

抱歉回复晚了。这是我训练的模型。 谷歌驱动器链接:链接

像这样的命令推断: python bin/infer.py --output-dir ./ --model-name valle --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --text-prompts "KNOT one point one five miles per hour." --audio-prompts ./prompts/8463_294825_000043_000000.wav --text "To get up and running quickly just follow the steps below." --checkpoint exp/epoch-100.pt

@hardik7我分享了我的预训练模型。这样你就可以合成卡通音频了。但我使用 LibriTTS 训练了我的模型,该模型由 550 小时的人类有声读物组成。最初的 val-e 是在 librilight 上进行训练的,它有 60k 小时的音频。

因此,由于缺乏卡通训练集和数据集数量,我的预训练模型无法合成卡通音频。

Hello! I am interested in your pre-training model. The pre-training weights you posted seem to be invalid. Can you share your pre-training model with me? Thank you!

@RafaelJCruz
Copy link

@thangnvkcn @jieen1 @LorenzoBrugioni @UncleSens @Zhang-Xiaoyi @lqj01 @UESTCgan @yiwei0730 @hackerxiaobai

Sorry for late reply. This is the model that I trained. google drive link : link

infer like this command: python bin/infer.py --output-dir ./ --model-name valle --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --text-prompts "KNOT one point one five miles per hour." --audio-prompts ./prompts/8463_294825_000043_000000.wav --text "To get up and running quickly just follow the steps below." --checkpoint exp/epoch-100.pt

@hardik7 I shared my pre-trained model. So you can synthesize the cartoon audio. But I trained my model using LibriTTS which is composed of 550 hours human audiobook. And original vall-e was trained on librilight which has 60k hours audio.

So, my pre-trained model has no capability to synthesize cartoon audio since lack of cartoon train set and lack of dataset amount.

Thanks for sharing, however, this google link has already expired. Could you update a new version? Thanks a lot!

@cad-audio
Copy link

@dohe0342 ,
could you please share the pre-trained model for VALL-E. The google link has expired.
If possible, please share us the training script which you have used.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests