-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Saving checkpoints on a steps basis instead of on an epoch basis creates an infinite loop #30
Comments
Hi @pjox, I can't reproduce this with e.g. zeldarose-transformer --cache-dir local/cache --pretrained-model "lgrobol/roberta-minuscule" --step-save-period 10000 "lgrobol/openminuscule:t
ext:train" --out-dir local/model Can you provide a MWE? |
Let me construct a working example, in the meantime, here is a list of all the packages of my environment after a fresh install (branch
|
Ping @pjox can you retry on the main branch? I've just pushed a ton of modifications and I'm curious if this still happens. |
I'll try next week and let you know, thanks a lot! 😄 |
Ok I can actually reproduce the issue (the context that triggers it is unclear). I think the issue is upstream, with lightning failing to set |
When using a corpus that is more than 20GB big it might make more sense to save checkpoints on a step basis instead of on an epoch basis.
zeldarose
already provides an option to do this. However, using the configuration--step-save-period 10000
produces what it seems to be an infinite loop on the 0th step where a checkpoint gets continually saved:Thanks in advance for the help 😄
The text was updated successfully, but these errors were encountered: