-
-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After 100 epochs training, the model can synthesize natural speech on LibriTTS #58
Comments
@dohe0342 prefix 0? and can you share your config? |
I can't understand prefix. What does it mean? Here is the shell script i ran. I just changed the "num-epochs", "max-duration" and "world-size". ./run.sh --stage 4 --stop-stage 4 --max-duration 50 --filter-max-duration 14 --num-decoder-layers 12 --world-size 8 --num-epochs 100 |
The mode for how to prefix VALL-E NAR Decoder, "54 | "0: no prefix, 1: 0 to random, 2: random to random.", |
@dohe0342 can you share your tensorboard image? |
I ran vall-e last week version which has no prefix option. And I found prefix 0 is same as vall-e last week version version. Here is my tensorboard image. I ran 177 epochs actually but 100-epoch checkpoint was used to generate audios. I'll soon upload tensorboard image. Please wait. |
Here is my tensorboard log. |
@dohe0342 |
Not me |
Thanks. Whether the prompt speaker of your demo wav in your training data? |
The prompt speakers are in test-clean not training data. |
@dohe0342 Thank you for sharing this. |
based on the latest commit? Thanks |
|
mark |
@dohe0342 can you share this model for me? wangjiashejieen@gmail.com here is my email. Thanks. |
Hey @dohe0342 , great work! Would you think it could be possible to share the pre-trained model? 🙏🏻🙏🏻🙏🏻 |
Thank you for your contribution @dohe0342! |
@dohe0342 Very nice results! Can you share your trained model if it is possible? my email is zhangxiaoyi1127@gmail.com |
@dohe0342 Very nice results! Can you share your trained model if it is possible? my email is liqianjin2018@gmail.com |
@dohe0342 I'm interested in your pre-training model. Can you share your pre-training model with me? Thank you! my email is : 15982350806@163.com |
I'm so interested in your pre-training model. The result was amazing, can you share the pretrained model with me ? I would be very appreciated. my email : yiwei110181@gmail.com |
Very nice results! Can you share your trained model if it is possible? my email is wl_9322@163.com |
@dohe0342 Interesting results! Could you please try synthesising audio from a cartoon character's audio prompt, something like this: https://drive.google.com/file/d/11NDZzopniwIFJa8dr4hAKp2md8cxel4w/view?usp=sharing |
@thangnvkcn @jieen1 @LorenzoBrugioni @UncleSens @Zhang-Xiaoyi @lqj01 @UESTCgan @yiwei0730 @hackerxiaobai Sorry for late reply. This is the model that I trained. infer like this command: @hardik7 I shared my pre-trained model. So you can synthesize the cartoon audio. But I trained my model using LibriTTS which is composed of 550 hours human audiobook. And original vall-e was trained on librilight which has 60k hours audio. So, my pre-trained model has no capability to synthesize cartoon audio since lack of cartoon train set and lack of dataset amount. |
Thanks for sharing. I have trained a model using the same config as yours. I just checked the ckpt at 30 epoch and it produces quite good results. I will compare with your ckpt at 100 epoch. |
Thank you @dohe0342. Will do some experiments with non-human voices and will train my own model with the relevant dataset. |
@dohe0342 hi, have you evaluated the inference speed? what's the RTF when generate audio? How about the correctness of the pronunciation? |
@dohe0342 |
@dohe0342 |
I received similar errors and had to install earlier versions of Python and torch to resolve them. It's just Python ML dependency hell; breaking API changes occur from version to version for little real reason. The versions of the big dependencies installed in my working VALL-E conda environment are: python: 3.8.0 |
Do you mean you switch your previous dependency to torch 1.13.1+cu116 and etc, and the missing key problem is solved? |
I solved this error when I switch the rep version to v0.1.0. It can infer successfully but the output is nothing but noise. Could you @dohe0342 please reveal the code zip you used to train this model? Thanks in advance. |
Finally, I find the corresponding version of code and was able to produce a correct output. Please find b83653a. |
i cannot run inference with the pretrained model provided, i get the following error:
|
@eschmidbauer you'll need to checkout the commit referenced in the comment #58 (comment) right above yours. |
thanks! i'll give that a try, i was actually able to continue training with the model though. |
Thanks for sharing the model. Tried the epoch, could generate a similar style of audio when text prompts and output text are all short. Will result in the warning below when either text prompts or output text are longer:
What's the proper way to generate longer audio? |
@dohe0342 on how many distinct speakers that u trained on, which you have shared your trained model? |
Not only this warning keeps to exist, but also the generated audio misses, babbles or repeats a lot of words if the text is relatively long. Is it because of 100 epochs on LibriTTS are still not able to learn well? |
i think the warning comes from the 3rd party tokenizer which is safe to ignore. @lifeiteng can you confirm this?
About the lacking performance on long text, i guess it is because most of the utterances in the train dataset is short, the model had never seen a long text.
To alleviate it, I suggest you finetune the model on longer utterances like librilight. Looking forward to your experiment result if you have time to so it.
…---Original---
From: ***@***.***>
Date: Tue, May 23, 2023 15:59 PM
To: ***@***.***>;
Cc: ***@***.***>;
Subject: Re: [lifeiteng/vall-e] After 100 epochs training, the model cansynthesize natural speech on LibriTTS (Issue #58)
Thanks for sharing the model.
Tried the epoch, could generate a similar style of audio when text prompts and output text are all short. Will result in the warning below when either text prompts or output text are longer:
WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1).
What's the proper way to generate longer audio?
Not only this warning keeps to exist, but also the generated audio misses, babbles or repeats a lot of words if the text is relatively long. Is it because of 100 epochs on LibriTTS are still not able to learn well?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
So, did you encounter this long waveform issue as well @chenjiasheng? Actually my test sentences come from LibriTTS test set, and it is supposed to have the same length distribution on the train set. I found that, using the checkpoint released in this thread, the model can hardly generate 100% correct speech if it is longer than something around 12s. Technically that length should not be regarded as "very long" usually. Does it mean that when training this model, most of the data longer than 12s are dropped by some way? |
@cantabile-kwok more info about |
So sorry that i missed your reply, Hope it is not too late.
By default audios longer than 14 second are truncated, because long audios are very RAM inefficient.
You can try change that argument named something like filter-max.
…---Original---
From: ***@***.***>
Date: Tue, May 23, 2023 18:23 PM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [lifeiteng/vall-e] After 100 epochs training, the model cansynthesize natural speech on LibriTTS (Issue #58)
So, did you encounter this long waveform issue as well @chenjiasheng? Actually my test sentences come from LibriTTS test set, and it is supposed to have the same length distribution on the train set. I found that, using the checkpoint released in this thread, the model can hardly generate 100% correct speech if it is longer than something around 12s. Technically that length should not be regarded as "very long" usually. Does it mean that when training this model, most of the data longer than 12s are dropped by some way?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Could anyone send me the "unique_text_tokens.k2symbols" because it is not in valle\egs\libritts\data\tokenized without training my email is grandmaskisses342@gmail.com |
Is 100 epoch ar and nar model each? The code has changed now, so I was wondering :) |
@sjoon2455 can you share your tensorboard? |
Many of us encountered the missing keys problem when loading the pretrained model. If anyone wants to use the pretrained model provided by @dohe0342, the main trick is that you should checkout to the right commit or any commit with the same valle model, and then reinstall valle by |
@dohe0342 Thanks for sharing the pretrained model which is trained for 100 epochs. When we say, 100 epochs, is it 100 each for AR and NAR or combined numbers where we start with an AR model (probably 50 epochs)? Pls clarify. I have trained a model for 100 epochs but quality isnt as good as shared by you here at the beginning. Thanks in advance |
Have you or has anyone else done further training? Also, which Libre dataset (size) was it? Thanks! |
@dohe0342 I'm interested in your pre-training model. Can you share your pre-training model with me? Thank you! my email is : xlwj_sd@163.com |
how
how does epoch-100.pt works with the inferences code provided in this repo |
@dohe0342 I'm interested in your pre-training model. Can you share your pre-training model with me? Thank you! my email is : ajeet9698@gmail.com |
and will it work if i want to train it on specific set of voices let's say 10 or 15 persons set of voices |
Hello! I am interested in your pre-training model. The pre-training weights you posted seem to be invalid. Can you share your pre-training model with me? Thank you! |
Thanks for sharing, however, this google link has already expired. Could you update a new version? Thanks a lot! |
@dohe0342 , Thanks |
I trained vall-e on LibriTTS about 100 epochs (took almost 4 days on 8 A100 GPUs) and I obtained plausible synthesized audio.
Here is a demo.
[1]
prompt : prompt_link
synthesized audio : synt_link
[2]
prompt : prompt_link
ground truth : gt_link
synthesized audio : synt_link
[3]
prompt : prompt_link
synthesized audio : synt_link
[4]
prompt : prompt_link
ground truth : gt_link
synthesized audio : synt_link
The model I trained has worse quality than original vall-e because of dataset amount. However, It has a promising quality in clean audio.
I'm not sure whether I can share my pre-trained LibriTTS model. If I can, I would like to share the pre-trained LibriTTS model.
The text was updated successfully, but these errors were encountered: