FSDP - TypeError: load_state_dict() got an unexpected keyword argument 'strict' #18511

shrinath-suresh · 2022-08-07T08:47:33Z

System Info

- `transformers` version: 4.22.0.dev0
- Platform: Linux-5.4.0-1072-aws-x86_64-with-debian-buster-sid
- Python version: 3.7.10
- Huggingface_hub version: 0.8.1
- PyTorch version (GPU?): 1.12.0+cu102 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Who can help?

@sgugger

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Steps to reproduce the behaviour:

Clone transformers - git clone https://github.com/huggingface/transformers.git
move to transformers folder - cd transformers
Install from source - pip install .
Move to image-classification example - cd examples/pytorch/image-classification
Train the model using fsdp

torchrun --nproc_per_node=4 run_image_classification.py       --dataset_name beans       --output_dir ./beans_outputs/       --remove_unused_columns False       --do_train       --do_eval       --learning_rate 2e-5       --num_train_epochs 5       --per_device_train_batch_size 8       --per_device_eval_batch_size 8       --logging_strategy steps       --logging_steps 10       --evaluation_strategy epoch       --save_strategy epoch       --load_best_model_at_end True       --save_total_limit 3       --seed 1337       --fsdp "full_shard auto_wrap"

Expected behavior

Model should get finetuned and saved successfully.

However, the following error is produced

[INFO|trainer.py:1949] 2022-08-07 08:35:00,771 >> Loading best model from ./beans_outputs/checkpoint-165 (score: 0.19044387340545654).
Traceback (most recent call last):
  File "run_image_classification.py", line 384, in <module>
    main()
  File "run_image_classification.py", line 358, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1509, in train
    ignore_keys_for_eval=ignore_keys_for_eval,
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1867, in _inner_training_loop
    self._load_best_model()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1992, in _load_best_model
    load_result = model.load_state_dict(state_dict, strict=False)
TypeError: load_state_dict() got an unexpected keyword argument 'strict'
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "run_image_classification.py", line 384, in <module>
  File "run_image_classification.py", line 384, in <module>
  File "run_image_classification.py", line 384, in <module>
        main()main()

  File "run_image_classification.py", line 358, in main
  File "run_image_classification.py", line 358, in main
    main()
  File "run_image_classification.py", line 358, in main
        train_result = trainer.train(resume_from_checkpoint=checkpoint)train_result = trainer.train(resume_from_checkpoint=checkpoint)

  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1509, in train
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1509, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1509, in train
        ignore_keys_for_eval=ignore_keys_for_eval,ignore_keys_for_eval=ignore_keys_for_eval,

  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1867, in _inner_training_loop
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1867, in _inner_training_loop
    ignore_keys_for_eval=ignore_keys_for_eval,
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1867, in _inner_training_loop
        self._load_best_model()self._load_best_model()

  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1992, in _load_best_model
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1992, in _load_best_model
    self._load_best_model()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1992, in _load_best_model
        load_result = model.load_state_dict(state_dict, strict=False)load_result = model.load_state_dict(state_dict, strict=False)

TypeErrorTypeError: : load_state_dict() got an unexpected keyword argument 'strict'load_state_dict() got an unexpected keyword argument 'strict'

    load_result = model.load_state_dict(state_dict, strict=False)
TypeError: load_state_dict() got an unexpected keyword argument 'strict'

Full example log -
fsdp_error.txt

Torch environment details:

PyTorch version: 1.12.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.22.4
Libc version: glibc-2.10

Python version: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37)  [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-1072-aws-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mlflow-torchserve==0.2.0
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.6
[pip3] numpydoc==1.1.0
[pip3] pytorch-kfp-components==0.1.0
[pip3] pytorch-lightning==1.6.5
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.12.0
[pip3] torch-model-archiver==0.6.0
[pip3] torch-optimizer==0.1.0
[pip3] torch-workflow-archiver==0.2.4b20220511
[pip3] torchdata==0.4.0
[pip3] torchmetrics==0.7.3
[pip3] torchserve==0.6.0
[pip3] torchtext==0.13.0
[pip3] torchvision==0.13.0
[conda] blas                      1.0                         mkl  
[conda] mkl                       2020.2                      256  
[conda] mkl-service               2.3.0            py37he8ac12f_0  
[conda] mkl_fft                   1.2.1            py37h54f3939_0  
[conda] mkl_random                1.1.1            py37h0573a6f_0  
[conda] mlflow-torchserve         0.2.0                    pypi_0    pypi
[conda] numpy                     1.21.6                   pypi_0    pypi
[conda] numpydoc                  1.1.0              pyhd3eb1b0_1  
[conda] pytorch-kfp-components    0.1.0                    pypi_0    pypi
[conda] pytorch-lightning         1.6.5                    pypi_0    pypi
[conda] pytorch-ranger            0.1.1                    pypi_0    pypi
[conda] torch                     1.12.0                   pypi_0    pypi
[conda] torch-model-archiver      0.6.0                    pypi_0    pypi
[conda] torch-optimizer           0.1.0                    pypi_0    pypi
[conda] torch-workflow-archiver   0.2.4b20220511           pypi_0    pypi
[conda] torchdata                 0.4.0                    pypi_0    pypi
[conda] torchmetrics              0.7.3                    pypi_0    pypi
[conda] torchserve                0.6.0                    pypi_0    pypi
[conda] torchtext                 0.13.0                   pypi_0    pypi
[conda] torchvision               0.13.0                   pypi_0    pypi

the issue seems to be appearing after this commit .

The text was updated successfully, but these errors were encountered:

pacman100 · 2022-08-08T12:50:13Z

Hello @shrinath-suresh , this issue has to be fixed from PyTorch side. The issue raised with PyTorch has been linked above.

pacman100 · 2022-08-08T13:20:44Z

Also, when using auto_wrap please specify either --fsdp_transformer_layer_cls_to_wrap <value> or --fsdp_min_num_params <number> as part of cmd arguments. This is what enables sharding of parameters, gradients and optimizer state across GPUs so that peak memory usage is further decreased drastically and you get the most out of using FSDP. For more details, please refer https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html and https://pytorch.org/docs/1.12/fsdp.html?highlight=fsdp#module-torch.distributed.fsdp.

🤗 Trainer FSDP integration doc is being updated to reflect the recent updates in this PR #18521. Please refer it for more details.

rohan-varma · 2022-08-10T18:20:50Z

Thanks for raising this issue! I responded in PT: pytorch/pytorch#82963. Although, not sure if HF uses nightlies/latest PT or a stable version. If we can't get pytorch updated in HF to include the fix, could we work around this by changing

model.load_state_dict(state_dict, strict=False)

to

model.load_state_dict(state_dict, False)

shrinath-suresh · 2022-08-11T03:19:39Z

@rohan-varma Thank you very much. I applied the fix as given in the screenshot and compiled from source. The model is gettting saved in the fsdp mode.

Attached image and logs for the same

vit_fsdp_with_fix.txt

github-actions · 2022-09-06T15:01:43Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

rohan-varma · 2022-10-05T22:31:28Z

This should be fixed in PyTorch nightly now: pytorch/pytorch#83309

shrinath-suresh added the bug label Aug 7, 2022

sgugger assigned pacman100 Aug 8, 2022

pacman100 mentioned this issue Aug 8, 2022

[FSDP] TypeError: load_state_dict() got an unexpected keyword argument 'strict' pytorch/pytorch#82963

Closed

patrickvonplaten mentioned this issue Aug 19, 2022

Add Google's Trillson Audio Classification Model #17387

Open

5 tasks

github-actions bot closed this as completed Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP - TypeError: load_state_dict() got an unexpected keyword argument 'strict' #18511

FSDP - TypeError: load_state_dict() got an unexpected keyword argument 'strict' #18511

shrinath-suresh commented Aug 7, 2022

pacman100 commented Aug 8, 2022

pacman100 commented Aug 8, 2022

rohan-varma commented Aug 10, 2022

shrinath-suresh commented Aug 11, 2022

github-actions bot commented Sep 6, 2022

rohan-varma commented Oct 5, 2022

FSDP - TypeError: load_state_dict() got an unexpected keyword argument 'strict' #18511

FSDP - TypeError: load_state_dict() got an unexpected keyword argument 'strict' #18511

Comments

shrinath-suresh commented Aug 7, 2022

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

pacman100 commented Aug 8, 2022

pacman100 commented Aug 8, 2022

rohan-varma commented Aug 10, 2022

shrinath-suresh commented Aug 11, 2022

github-actions bot commented Sep 6, 2022

rohan-varma commented Oct 5, 2022