Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train model does not generate any speech at all #3

Open
Many0therFunctions opened this issue Jun 1, 2024 · 22 comments
Open

Train model does not generate any speech at all #3

Many0therFunctions opened this issue Jun 1, 2024 · 22 comments

Comments

@Many0therFunctions
Copy link

All trained hifigan models come out sounding like this. It just generates straight mel spectrogram bands.

image

https://vocaroo.com/18YQzfRyOJMV

@tuanh123789
Copy link
Owner

Output of Hifigan is waveform

@Many0therFunctions
Copy link
Author

I know. I was analyzing the wav files in spectral view to see. Something is very wrong there...

@Abhinay1997
Copy link

Same issue here.

Training run spectrograms and samples from trained weights attached:
speech_comparison
fake

https://voca.ro/19j9UzebhNzw

@tuanh123789
Copy link
Owner

Please describe your training dataset

@Abhinay1997
Copy link

Abhinay1997 commented Jun 14, 2024

I'm using Hindi (language_code = "hi") speech samples ~1200 from google/fleurs dataset. The language is part of the original xttsv2 checkpoint already.

I generated latents using generate_latents.py and the samples in /synthesis are decent. I can understand them. Training loss went down and came back up in like 20 epochs and stayed there for the rest of the training.

Same static noisy audio results from 20th epoch onwards

@tuanh123789
Copy link
Owner

tuanh123789 commented Jun 14, 2024

In my experiment, If it's not English, the amount of data needed to finetune is quite large. I think 1200 samples is a bit small. With Vietnamese, I have to use near 100hours to get good results

@Abhinay1997
Copy link

I see, thats interesting. I assumed it wouldn't require too much because the hifigan decoder from the original checkpoint was already trained on it. I'll experiment and let you know ! Thank you :)

@Many0therFunctions
Copy link
Author

Many0therFunctions commented Jun 14, 2024

................. if I didn't know better, I'd almost think it would be a better use of time and resources training an AI to instead convert GPT latents to their corresponding encodec tokens like with bark and then feeding that into a vocoder that uses encodec.....................

-mumbles about just wanting a simple fix to have xttsv2 handle screams and yells but it seems we have to do the roundabout thing like with bark redefining what this and that token maps to what sound.... -

@Abhinay1997
Copy link

That's interesting ! I added multiple new languages to xtts with decent speech output with a couple of hundred hours of speech for each language while sacrificing performance on original languages.

I didn't think it'd be that complex to add screams/yells and other custom sounds.

The reason I'm looking to train hifigan is to get human quality audio and I'm not sure where to go from here if this fails. Audio super resolution techniques have all failed for me.

@Many0therFunctions
Copy link
Author

I'm really disappointed because in bark it was child's play get such things, it's just bark was 1. WAY too slow, and 2. too unpredictable, which wouldnt be such an issue if it wasn't so God-Pounding SLOW

@Abhinay1997
Copy link

Yes, Bark samples on their demo page seem to be cherry picked. The inconsistency in audio generation doesn't work for my usecase. Agreed, its very very slow.

@Abhinay1997
Copy link

Just an FYI, @tuanh123789 is likely using a trained xtts checkpoint so the model state dict keys have xtts prefixed like this xtts.hifigan_decoder.waveform_decoder. However the checkpoint on huggingface released by coqui has no xtts in the key name. I had to make a few more changes to load the original checkpoint correctly.

However after all this, the same issue happened. Train loss starts around 110, goes to approx 70 and goes back up to 78-81 and stabilises there. Eval audio improves from xtts level speech at iter 0 to static noise to barely legible speech with loud static noise.

Using 300 hours of English samples.

@hscspring
Copy link

@Abhinay1997 hi, i got exactly the same issue like you.
I'm training two languages (en and my lang) together, the spectrograms are as same as yours.
But, the result is not so good, especially my lang, it's seems failed, en seems good.
200+ hours data.

@Abhinay1997
Copy link

@hscspring , did you modify train.py to load the huggingface xtts checkpoint or are you using your trained chekpoint ? I'm trying to see what the issue could be.

@hscspring
Copy link

@Abhinay1997 Actually, you can use either the original or your finetuned checkpoint, because the hifigan checkpoints are the same.
maybe you can finetune the speaker encoder together.

btw, i have the similar issue you've met. and i still don't know why.

@Abhinay1997
Copy link

Abhinay1997 commented Jun 28, 2024

@hscspring, True while the state_dict has same values, the keys are different. You can test this by trying to load the huggingface checkpoint with strict=True here. so you will be actually training the hifigan from scratch if you use the original checkpoint.

As for the other issue, I'm still checking. I'll do a training run on the weekend

@hscspring
Copy link

@Abhinay1997 I modified the code. strict=True is always a good habit.
I just found another issue (it's my own problem, i modifed the arch of xtts).
now waiting for the new result~

@Many0therFunctions
Copy link
Author

Many0therFunctions commented Jun 28, 2024

I have a heavy suspicion why. It's that this really is impossible without the official discriminator network and there's no way to just regenerate the discriminator having only the generator weights... I don't know.

Good catch there though. Definitely overlooked that.

(Training dataset here IS english so it should've been trivial to fine-tune and yet it is more like training completely from scratch which I definitely don't have the compute resources for. I really hope this training isn't supposed to be some monte carlo statistical reverse engineering of a discriminator network because that WILL require VAST amounts of compute and storage to be robust, especially using some of the more effective optimizers)

@hscspring
Copy link

hscspring commented Jun 29, 2024

modify this in hifigan_decoder.py:

-        resblock_type,
-        resblock_dilation_sizes,
-        resblock_kernel_sizes,
-        upsample_kernel_sizes,
-        upsample_initial_channel,
-        upsample_factors,
-        inference_padding=5,
-        cond_channels=0,
-        conv_pre_weight_norm=True,
-        conv_post_weight_norm=True,
-        conv_post_bias=True,
-        cond_in_each_up_layer=False,
+        resblock_type, # "1"
+        resblock_dilation_sizes, # [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
+        resblock_kernel_sizes, # [3, 7, 11]
+        upsample_kernel_sizes, # [16, 16, 4, 4]
+        upsample_initial_channel, # 512
+        upsample_factors, # [8, 8, 2, 2]
+        inference_padding=0,
+        cond_channels=512,
+        conv_pre_weight_norm=False,
+        conv_post_weight_norm=False,
+        conv_post_bias=False,
+        cond_in_each_up_layer=True,

and unsqueeze z in gpt_gan.py:

             z = batch["speaker_embedding"]
+        z = z.unsqueeze(-1)

@Abhinay1997 I modified the code. strict=True is always a good habit. I just found another issue (it's my own problem, i modifed the arch of xtts). now waiting for the new result~

@Abhinay1997
Copy link

@hscspring Thank you for confirming on these ! I made the same changes in hifigan_decoder.py but wasn't sure if I messed something. Will have to compare the change in gpt_gan.py as I remember using a transpose to pass the batch.

@Abhinay1997
Copy link

@Many0therFunctions thats a very valid point. Probably also why it requires so much data to train in the first place.

@hscspring
Copy link

hscspring commented Jun 29, 2024

maybe

 +        conv_pre_weight_norm=True,
+        conv_post_weight_norm=True,
+        conv_post_bias=True,

when training. (means not remove_parametrizations)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants