Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Saving checkpoints on a steps basis instead of on an epoch basis creates an infinite loop #30

Closed
pjox opened this issue Jul 19, 2022 · 5 comments
Assignees
Labels
bug Something isn't working needs info

Comments

@pjox
Copy link

pjox commented Jul 19, 2022

When using a corpus that is more than 20GB big it might make more sense to save checkpoints on a step basis instead of on an epoch basis. zeldarose already provides an option to do this. However, using the configuration --step-save-period 10000 produces what it seems to be an infinite loop on the 0th step where a checkpoint gets continually saved:

[zeldarose (0 [0@r9i3n6])] 2022-07-06T17:28:14 INFO:  Saving intermediate model to /gpfsssd/scratch/rech/rcy/uok84lv/fr_clean_dedup/949622/partway_models/step_0
[zeldarose (0 [0@r9i3n6])] 2022-07-06T17:28:14 INFO:  Saving model to /gpfsssd/scratch/rech/rcy/uok84lv/fr_clean_dedup/949622/partway_models/step_0
[zeldarose (0 [0@r9i3n6])] 2022-07-06T17:28:15 INFO:  Saving tokenizer to /gpfsssd/scratch/rech/rcy/uok84lv/fr_clean_dedup/949622/partway_models/step_0
[zeldarose (0 [0@r9i3n6])] 2022-07-06T17:28:15 INFO:  Saving intermediate model to /gpfsssd/scratch/rech/rcy/uok84lv/fr_clean_dedup/949622/partway_models/step_0
[zeldarose (0 [0@r9i3n6])] 2022-07-06T17:28:15 INFO:  Saving model to /gpfsssd/scratch/rech/rcy/uok84lv/fr_clean_dedup/949622/partway_models/step_0
[zeldarose (0 [0@r9i3n6])] 2022-07-06T17:28:16 INFO:  Saving tokenizer to /gpfsssd/scratch/rech/rcy/uok84lv/fr_clean_dedup/949622/partway_models/step_0
[zeldarose (0 [0@r9i3n6])] 2022-07-06T17:28:16 INFO:  Saving intermediate model to /gpfsssd/scratch/rech/rcy/uok84lv/fr_clean_dedup/949622/partway_models/step_0
[zeldarose (0 [0@r9i3n6])] 2022-07-06T17:28:16 INFO:  Saving model to /gpfsssd/scratch/rech/rcy/uok84lv/fr_clean_dedup/949622/partway_models/step_0
[zeldarose (0 [0@r9i3n6])] 2022-07-06T17:28:17 INFO:  Saving tokenizer to /gpfsssd/scratch/rech/rcy/uok84lv/fr_clean_dedup/949622/partway_models/step_0
[zeldarose (0 [0@r9i3n6])] 2022-07-06T17:28:17 INFO:  Saving intermediate model to /gpfsssd/scratch/rech/rcy/uok84lv/fr_clean_dedup/949622/partway_models/step_0
[zeldarose (0 [0@r9i3n6])] 2022-07-06T17:28:17 INFO:  Saving model to /gpfsssd/scratch/rech/rcy/uok84lv/fr_clean_dedup/949622/partway_models/step_0
[zeldarose (0 [0@r9i3n6])] 2022-07-06T17:28:17 INFO:  Saving tokenizer to /gpfsssd/scratch/rech/rcy/uok84lv/fr_clean_dedup/949622/partway_models/step_0
[zeldarose (0 [0@r9i3n6])] 2022-07-06T17:28:18 INFO:  Saving intermediate model to /gpfsssd/scratch/rech/rcy/uok84lv/fr_clean_dedup/949622/partway_models/step_0
[zeldarose (0 [0@r9i3n6])] 2022-07-06T17:28:18 INFO:  Saving model to /gpfsssd/scratch/rech/rcy/uok84lv/fr_clean_dedup/949622/partway_models/step_0
[zeldarose (0 [0@r9i3n6])] 2022-07-06T17:28:18 INFO:  Saving tokenizer to /gpfsssd/scratch/rech/rcy/uok84lv/fr_clean_dedup/949622/partway_models/step_0

Thanks in advance for the help 😄

@pjox pjox changed the title Saving checkpoints on a steps basis instead of on an epoch basis creates an infinite loop [Bug] Saving checkpoints on a steps basis instead of on an epoch basis creates an infinite loop Jul 19, 2022
@LoicGrobol LoicGrobol self-assigned this Jul 26, 2022
@LoicGrobol LoicGrobol added the bug Something isn't working label Jul 26, 2022
@LoicGrobol
Copy link
Owner

Hi @pjox, I can't reproduce this with e.g.

zeldarose-transformer --cache-dir local/cache --pretrained-model "lgrobol/roberta-minuscule" --step-save-period 10000 "lgrobol/openminuscule:t
ext:train" --out-dir local/model

Can you provide a MWE?

@pjox
Copy link
Author

pjox commented Jul 27, 2022

Let me construct a working example, in the meantime, here is a list of all the packages of my environment after a fresh install (branch datasets-2.4) in case that might help:

# packages in environment at /gpfswork/rech/project/user/miniconda3/envs/zeldarose-test:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
_openmp_mutex             5.1                       1_gnu
absl-py                   1.2.0                    pypi_0    pypi
aiohttp                   3.8.1                    pypi_0    pypi
aiosignal                 1.2.0                    pypi_0    pypi
async-timeout             4.0.2                    pypi_0    pypi
attrs                     21.4.0                   pypi_0    pypi
bzip2                     1.0.8                h7b6447c_0
ca-certificates           2022.07.19           h06a4308_0
cachetools                5.2.0                    pypi_0    pypi
certifi                   2022.6.15       py310h06a4308_0
charset-normalizer        2.1.0                    pypi_0    pypi
click                     8.1.3                    pypi_0    pypi
commonmark                0.9.1                    pypi_0    pypi
datasets                  2.4.0                    pypi_0    pypi
dill                      0.3.5.1                  pypi_0    pypi
filelock                  3.7.1                    pypi_0    pypi
frozenlist                1.3.0                    pypi_0    pypi
fsspec                    2022.5.0                 pypi_0    pypi
google-auth               2.9.1                    pypi_0    pypi
google-auth-oauthlib      0.4.6                    pypi_0    pypi
grpcio                    1.47.0                   pypi_0    pypi
huggingface-hub           0.8.1                    pypi_0    pypi
idna                      3.3                      pypi_0    pypi
joblib                    1.1.0                    pypi_0    pypi
ld_impl_linux-64          2.38                 h1181459_1
libffi                    3.3                  he6710b0_2
libgcc-ng                 11.2.0               h1234567_1
libgomp                   11.2.0               h1234567_1
libstdcxx-ng              11.2.0               h1234567_1
libuuid                   1.0.3                h7f8727e_2
loguru                    0.6.0                    pypi_0    pypi
markdown                  3.4.1                    pypi_0    pypi
markupsafe                2.1.1                    pypi_0    pypi
multidict                 6.0.2                    pypi_0    pypi
multiprocess              0.70.13                  pypi_0    pypi
ncurses                   6.3                  h5eee18b_3
numpy                     1.23.1                   pypi_0    pypi
oauthlib                  3.2.0                    pypi_0    pypi
openssl                   1.1.1q               h7f8727e_0
packaging                 21.3                     pypi_0    pypi
pandas                    1.4.3                    pypi_0    pypi
pip                       22.1.2          py310h06a4308_0
protobuf                  3.19.4                   pypi_0    pypi
pyarrow                   8.0.0                    pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pydantic                  1.9.1                    pypi_0    pypi
pydeprecate               0.3.2                    pypi_0    pypi
pygments                  2.12.0                   pypi_0    pypi
pyparsing                 3.0.9                    pypi_0    pypi
python                    3.10.4               h12debd9_0
python-dateutil           2.8.2                    pypi_0    pypi
pytorch-lightning         1.6.5                    pypi_0    pypi
pytz                      2022.1                   pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
readline                  8.1.2                h7f8727e_1
regex                     2022.7.25                pypi_0    pypi
requests                  2.28.1                   pypi_0    pypi
requests-oauthlib         1.3.1                    pypi_0    pypi
responses                 0.18.0                   pypi_0    pypi
rich                      12.5.1                   pypi_0    pypi
rsa                       4.9                      pypi_0    pypi
sacremoses                0.0.53                   pypi_0    pypi
setuptools                61.2.0          py310h06a4308_0
six                       1.16.0                   pypi_0    pypi
sqlite                    3.38.5               hc218d9a_0
tensorboard               2.9.1                    pypi_0    pypi
tensorboard-data-server   0.6.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.1                    pypi_0    pypi
tk                        8.6.12               h1ccaba5_0
tokenizers                0.12.1                   pypi_0    pypi
toml                      0.10.2                   pypi_0    pypi
torch                     1.12.0                   pypi_0    pypi
torchmetrics              0.9.3                    pypi_0    pypi
tqdm                      4.64.0                   pypi_0    pypi
transformers              4.20.1                   pypi_0    pypi
typing-extensions         4.3.0                    pypi_0    pypi
tzdata                    2022a                hda174b7_0
urllib3                   1.26.11                  pypi_0    pypi
werkzeug                  2.2.0                    pypi_0    pypi
wheel                     0.37.1             pyhd3eb1b0_0
xxhash                    3.0.0                    pypi_0    pypi
xz                        5.2.5                h7f8727e_1
yarl                      1.7.2                    pypi_0    pypi
zeldarose                 0.5.0                    pypi_0    pypi
zlib                      1.2.12               h7f8727e_2

@LoicGrobol
Copy link
Owner

Ping @pjox can you retry on the main branch? I've just pushed a ton of modifications and I'm curious if this still happens.

@pjox
Copy link
Author

pjox commented Feb 24, 2023

I'll try next week and let you know, thanks a lot! 😄

@LoicGrobol
Copy link
Owner

LoicGrobol commented Oct 5, 2023

Ok I can actually reproduce the issue (the context that triggers it is unclear). I think the issue is upstream, with lightning failing to set Trainer.global_step correctly (it stays 0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs info
Projects
None yet
Development

No branches or pull requests

2 participants