Train model does not generate any speech at all #3

Many0therFunctions · 2024-06-01T02:37:33Z

All trained hifigan models come out sounding like this. It just generates straight mel spectrogram bands.

https://vocaroo.com/18YQzfRyOJMV

tuanh123789 · 2024-06-02T16:08:35Z

Output of Hifigan is waveform

Many0therFunctions · 2024-06-03T17:34:08Z

I know. I was analyzing the wav files in spectral view to see. Something is very wrong there...

Abhinay1997 · 2024-06-14T03:12:02Z

Same issue here.

Training run spectrograms and samples from trained weights attached:

https://voca.ro/19j9UzebhNzw

tuanh123789 · 2024-06-14T03:18:05Z

Please describe your training dataset

Abhinay1997 · 2024-06-14T03:55:56Z

I'm using Hindi (language_code = "hi") speech samples ~1200 from google/fleurs dataset. The language is part of the original xttsv2 checkpoint already.

I generated latents using generate_latents.py and the samples in /synthesis are decent. I can understand them. Training loss went down and came back up in like 20 epochs and stayed there for the rest of the training.

Same static noisy audio results from 20th epoch onwards

tuanh123789 · 2024-06-14T04:00:37Z

In my experiment, If it's not English, the amount of data needed to finetune is quite large. I think 1200 samples is a bit small. With Vietnamese, I have to use near 100hours to get good results

Abhinay1997 · 2024-06-14T04:03:01Z

I see, thats interesting. I assumed it wouldn't require too much because the hifigan decoder from the original checkpoint was already trained on it. I'll experiment and let you know ! Thank you :)

Many0therFunctions · 2024-06-14T04:32:30Z

................. if I didn't know better, I'd almost think it would be a better use of time and resources training an AI to instead convert GPT latents to their corresponding encodec tokens like with bark and then feeding that into a vocoder that uses encodec.....................

-mumbles about just wanting a simple fix to have xttsv2 handle screams and yells but it seems we have to do the roundabout thing like with bark redefining what this and that token maps to what sound.... -

Abhinay1997 · 2024-06-14T04:47:12Z

That's interesting ! I added multiple new languages to xtts with decent speech output with a couple of hundred hours of speech for each language while sacrificing performance on original languages.

I didn't think it'd be that complex to add screams/yells and other custom sounds.

The reason I'm looking to train hifigan is to get human quality audio and I'm not sure where to go from here if this fails. Audio super resolution techniques have all failed for me.

Many0therFunctions · 2024-06-14T04:49:35Z

I'm really disappointed because in bark it was child's play get such things, it's just bark was 1. WAY too slow, and 2. too unpredictable, which wouldnt be such an issue if it wasn't so God-Pounding SLOW

Abhinay1997 · 2024-06-14T08:51:50Z

Yes, Bark samples on their demo page seem to be cherry picked. The inconsistency in audio generation doesn't work for my usecase. Agreed, its very very slow.

Abhinay1997 · 2024-06-26T11:14:13Z

Just an FYI, @tuanh123789 is likely using a trained xtts checkpoint so the model state dict keys have xtts prefixed like this xtts.hifigan_decoder.waveform_decoder. However the checkpoint on huggingface released by coqui has no xtts in the key name. I had to make a few more changes to load the original checkpoint correctly.

However after all this, the same issue happened. Train loss starts around 110, goes to approx 70 and goes back up to 78-81 and stabilises there. Eval audio improves from xtts level speech at iter 0 to static noise to barely legible speech with loud static noise.

Using 300 hours of English samples.

hscspring · 2024-06-28T06:31:54Z

@Abhinay1997 hi, i got exactly the same issue like you.
I'm training two languages (en and my lang) together, the spectrograms are as same as yours.
But, the result is not so good, especially my lang, it's seems failed, en seems good.
200+ hours data.

Abhinay1997 · 2024-06-28T06:36:52Z

@hscspring , did you modify train.py to load the huggingface xtts checkpoint or are you using your trained chekpoint ? I'm trying to see what the issue could be.

hscspring · 2024-06-28T07:46:15Z

@Abhinay1997 Actually, you can use either the original or your finetuned checkpoint, because the hifigan checkpoints are the same.
maybe you can finetune the speaker encoder together.

btw, i have the similar issue you've met. and i still don't know why.

Abhinay1997 · 2024-06-28T07:59:32Z

@hscspring, True while the state_dict has same values, the keys are different. You can test this by trying to load the huggingface checkpoint with strict=True here. so you will be actually training the hifigan from scratch if you use the original checkpoint.

As for the other issue, I'm still checking. I'll do a training run on the weekend

hscspring · 2024-06-28T10:51:36Z

@Abhinay1997 I modified the code. strict=True is always a good habit.
I just found another issue (it's my own problem, i modifed the arch of xtts).
now waiting for the new result~

Many0therFunctions · 2024-06-28T17:55:12Z

I have a heavy suspicion why. It's that this really is impossible without the official discriminator network and there's no way to just regenerate the discriminator having only the generator weights... I don't know.

Good catch there though. Definitely overlooked that.

(Training dataset here IS english so it should've been trivial to fine-tune and yet it is more like training completely from scratch which I definitely don't have the compute resources for. I really hope this training isn't supposed to be some monte carlo statistical reverse engineering of a discriminator network because that WILL require VAST amounts of compute and storage to be robust, especially using some of the more effective optimizers)

hscspring · 2024-06-29T01:47:11Z

modify this in hifigan_decoder.py:

-        resblock_type,
-        resblock_dilation_sizes,
-        resblock_kernel_sizes,
-        upsample_kernel_sizes,
-        upsample_initial_channel,
-        upsample_factors,
-        inference_padding=5,
-        cond_channels=0,
-        conv_pre_weight_norm=True,
-        conv_post_weight_norm=True,
-        conv_post_bias=True,
-        cond_in_each_up_layer=False,
+        resblock_type, # "1"
+        resblock_dilation_sizes, # [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
+        resblock_kernel_sizes, # [3, 7, 11]
+        upsample_kernel_sizes, # [16, 16, 4, 4]
+        upsample_initial_channel, # 512
+        upsample_factors, # [8, 8, 2, 2]
+        inference_padding=0,
+        cond_channels=512,
+        conv_pre_weight_norm=False,
+        conv_post_weight_norm=False,
+        conv_post_bias=False,
+        cond_in_each_up_layer=True,

and unsqueeze z in gpt_gan.py:

             z = batch["speaker_embedding"]
+        z = z.unsqueeze(-1)

@Abhinay1997 I modified the code. strict=True is always a good habit. I just found another issue (it's my own problem, i modifed the arch of xtts). now waiting for the new result~

Abhinay1997 · 2024-06-29T02:07:51Z

@hscspring Thank you for confirming on these ! I made the same changes in hifigan_decoder.py but wasn't sure if I messed something. Will have to compare the change in gpt_gan.py as I remember using a transpose to pass the batch.

Abhinay1997 · 2024-06-29T02:18:19Z

@Many0therFunctions thats a very valid point. Probably also why it requires so much data to train in the first place.

hscspring · 2024-06-29T08:30:51Z

maybe

 +        conv_pre_weight_norm=True,
+        conv_post_weight_norm=True,
+        conv_post_bias=True,

when training. (means not remove_parametrizations)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train model does not generate any speech at all #3

Train model does not generate any speech at all #3

Many0therFunctions commented Jun 1, 2024

tuanh123789 commented Jun 2, 2024

Many0therFunctions commented Jun 3, 2024

Abhinay1997 commented Jun 14, 2024

tuanh123789 commented Jun 14, 2024

Abhinay1997 commented Jun 14, 2024 •

edited

Loading

tuanh123789 commented Jun 14, 2024 •

edited

Loading

Abhinay1997 commented Jun 14, 2024

Many0therFunctions commented Jun 14, 2024 •

edited

Loading

Abhinay1997 commented Jun 14, 2024

Many0therFunctions commented Jun 14, 2024

Abhinay1997 commented Jun 14, 2024

Abhinay1997 commented Jun 26, 2024

hscspring commented Jun 28, 2024

Abhinay1997 commented Jun 28, 2024

hscspring commented Jun 28, 2024

Abhinay1997 commented Jun 28, 2024 •

edited

Loading

hscspring commented Jun 28, 2024

Many0therFunctions commented Jun 28, 2024 •

edited

Loading

hscspring commented Jun 29, 2024 •

edited

Loading

Abhinay1997 commented Jun 29, 2024

Abhinay1997 commented Jun 29, 2024

hscspring commented Jun 29, 2024 •

edited

Loading

Train model does not generate any speech at all #3

Train model does not generate any speech at all #3

Comments

Many0therFunctions commented Jun 1, 2024

tuanh123789 commented Jun 2, 2024

Many0therFunctions commented Jun 3, 2024

Abhinay1997 commented Jun 14, 2024

tuanh123789 commented Jun 14, 2024

Abhinay1997 commented Jun 14, 2024 • edited Loading

tuanh123789 commented Jun 14, 2024 • edited Loading

Abhinay1997 commented Jun 14, 2024

Many0therFunctions commented Jun 14, 2024 • edited Loading

Abhinay1997 commented Jun 14, 2024

Many0therFunctions commented Jun 14, 2024

Abhinay1997 commented Jun 14, 2024

Abhinay1997 commented Jun 26, 2024

hscspring commented Jun 28, 2024

Abhinay1997 commented Jun 28, 2024

hscspring commented Jun 28, 2024

Abhinay1997 commented Jun 28, 2024 • edited Loading

hscspring commented Jun 28, 2024

Many0therFunctions commented Jun 28, 2024 • edited Loading

hscspring commented Jun 29, 2024 • edited Loading

Abhinay1997 commented Jun 29, 2024

Abhinay1997 commented Jun 29, 2024

hscspring commented Jun 29, 2024 • edited Loading

Abhinay1997 commented Jun 14, 2024 •

edited

Loading

tuanh123789 commented Jun 14, 2024 •

edited

Loading

Many0therFunctions commented Jun 14, 2024 •

edited

Loading

Abhinay1997 commented Jun 28, 2024 •

edited

Loading

Many0therFunctions commented Jun 28, 2024 •

edited

Loading

hscspring commented Jun 29, 2024 •

edited

Loading

hscspring commented Jun 29, 2024 •

edited

Loading