Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train loss of custom data #133

Open
Wangzhen-kris opened this issue Jun 12, 2023 · 2 comments
Open

train loss of custom data #133

Wangzhen-kris opened this issue Jun 12, 2023 · 2 comments

Comments

@Wangzhen-kris
Copy link

Hi,

I tried to train on my dataset, but I seem to have an abnormal loss curve. Do you have any suggestions?
Thanks.

The loss of AR:
https://drive.google.com/file/d/1-gZJX-mwYZ-2vkKTl0dTwBcp1A8MHrmV/view?usp=drive_link
image
The loss of NAR:
https://drive.google.com/file/d/1-9L_AQZyyAgDRqKPpx06w6M99ZPSUIhe/view?usp=drive_link
image

@RuntimeRacer
Copy link
Contributor

RuntimeRacer commented Jun 21, 2023

Hi @Wangzhen-kris, what kind of data does your dataset consist of? Is it by any chance containing very diverse speakers or even multiple languages? Also, are they organized into separate cut sets which were combined for training?

While trying to train on Apache CommonVoice I ran into similar graphs. I found out that the usage of the Lhotse Dynamic samplers leads to the issue of static CutSet order - Which means Language C always gets trained after B, which is trained after A.
Also this leads to the Model biasing a lot towards the CutSet it was trained last on. For example, all my Inference tests at the end of one epoch had a french dialect.

I figured a solution for this, by randomizing the CutSet contents before training. It is quite Memory Intensive on a large dataset (~60 GB needed for almost complete CommonVoice 13) and also quite slow since it's a single threaded process. Takes about 10 Minutes on my AI server.
I still want to make this a bit better, for example, have it resample after each epoch (currently it does once at trainings start, and only if there is no randomized file already). But you could have a look at my branch; maybe it's helpful for you:

main...RuntimeRacer:vall-e:cuts_randomizer

Also I attached a screenshot how this stabilized my training; the arrows point to where this was applied after 2 epochs without this pre-processing:

image

@lifeiteng lifeiteng pinned this issue Sep 14, 2023
@MajoRoth
Copy link

Im facing the same issue and trying to debug.
What causes Lhotse dynamic samplers to load in order?
im shuffling the files in the tokenization part, and using shuffle=True,
but still getting weird loss graphs that indicate that something is wrong:

image image

this patterns occurs each epoch...
any clues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants