Skip to content

FSDP - TypeError: load_state_dict() got an unexpected keyword argument 'strict' #18511

@shrinath-suresh

Description

@shrinath-suresh

System Info

- `transformers` version: 4.22.0.dev0
- Platform: Linux-5.4.0-1072-aws-x86_64-with-debian-buster-sid
- Python version: 3.7.10
- Huggingface_hub version: 0.8.1
- PyTorch version (GPU?): 1.12.0+cu102 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Who can help?

@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce the behaviour:

  1. Clone transformers - git clone https://github.com/huggingface/transformers.git
  2. move to transformers folder - cd transformers
  3. Install from source - pip install .
  4. Move to image-classification example - cd examples/pytorch/image-classification
  5. Train the model using fsdp
torchrun --nproc_per_node=4 run_image_classification.py       --dataset_name beans       --output_dir ./beans_outputs/       --remove_unused_columns False       --do_train       --do_eval       --learning_rate 2e-5       --num_train_epochs 5       --per_device_train_batch_size 8       --per_device_eval_batch_size 8       --logging_strategy steps       --logging_steps 10       --evaluation_strategy epoch       --save_strategy epoch       --load_best_model_at_end True       --save_total_limit 3       --seed 1337       --fsdp "full_shard auto_wrap"

Expected behavior

Model should get finetuned and saved successfully.

However, the following error is produced

[INFO|trainer.py:1949] 2022-08-07 08:35:00,771 >> Loading best model from ./beans_outputs/checkpoint-165 (score: 0.19044387340545654).
Traceback (most recent call last):
  File "run_image_classification.py", line 384, in <module>
    main()
  File "run_image_classification.py", line 358, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1509, in train
    ignore_keys_for_eval=ignore_keys_for_eval,
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1867, in _inner_training_loop
    self._load_best_model()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1992, in _load_best_model
    load_result = model.load_state_dict(state_dict, strict=False)
TypeError: load_state_dict() got an unexpected keyword argument 'strict'
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "run_image_classification.py", line 384, in <module>
  File "run_image_classification.py", line 384, in <module>
  File "run_image_classification.py", line 384, in <module>
        main()main()

  File "run_image_classification.py", line 358, in main
  File "run_image_classification.py", line 358, in main
    main()
  File "run_image_classification.py", line 358, in main
        train_result = trainer.train(resume_from_checkpoint=checkpoint)train_result = trainer.train(resume_from_checkpoint=checkpoint)

  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1509, in train
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1509, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1509, in train
        ignore_keys_for_eval=ignore_keys_for_eval,ignore_keys_for_eval=ignore_keys_for_eval,

  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1867, in _inner_training_loop
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1867, in _inner_training_loop
    ignore_keys_for_eval=ignore_keys_for_eval,
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1867, in _inner_training_loop
        self._load_best_model()self._load_best_model()

  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1992, in _load_best_model
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1992, in _load_best_model
    self._load_best_model()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1992, in _load_best_model
        load_result = model.load_state_dict(state_dict, strict=False)load_result = model.load_state_dict(state_dict, strict=False)

TypeErrorTypeError: : load_state_dict() got an unexpected keyword argument 'strict'load_state_dict() got an unexpected keyword argument 'strict'

    load_result = model.load_state_dict(state_dict, strict=False)
TypeError: load_state_dict() got an unexpected keyword argument 'strict'

Full example log -
fsdp_error.txt

Torch environment details:

PyTorch version: 1.12.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.22.4
Libc version: glibc-2.10

Python version: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37)  [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-1072-aws-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mlflow-torchserve==0.2.0
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.6
[pip3] numpydoc==1.1.0
[pip3] pytorch-kfp-components==0.1.0
[pip3] pytorch-lightning==1.6.5
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.12.0
[pip3] torch-model-archiver==0.6.0
[pip3] torch-optimizer==0.1.0
[pip3] torch-workflow-archiver==0.2.4b20220511
[pip3] torchdata==0.4.0
[pip3] torchmetrics==0.7.3
[pip3] torchserve==0.6.0
[pip3] torchtext==0.13.0
[pip3] torchvision==0.13.0
[conda] blas                      1.0                         mkl  
[conda] mkl                       2020.2                      256  
[conda] mkl-service               2.3.0            py37he8ac12f_0  
[conda] mkl_fft                   1.2.1            py37h54f3939_0  
[conda] mkl_random                1.1.1            py37h0573a6f_0  
[conda] mlflow-torchserve         0.2.0                    pypi_0    pypi
[conda] numpy                     1.21.6                   pypi_0    pypi
[conda] numpydoc                  1.1.0              pyhd3eb1b0_1  
[conda] pytorch-kfp-components    0.1.0                    pypi_0    pypi
[conda] pytorch-lightning         1.6.5                    pypi_0    pypi
[conda] pytorch-ranger            0.1.1                    pypi_0    pypi
[conda] torch                     1.12.0                   pypi_0    pypi
[conda] torch-model-archiver      0.6.0                    pypi_0    pypi
[conda] torch-optimizer           0.1.0                    pypi_0    pypi
[conda] torch-workflow-archiver   0.2.4b20220511           pypi_0    pypi
[conda] torchdata                 0.4.0                    pypi_0    pypi
[conda] torchmetrics              0.7.3                    pypi_0    pypi
[conda] torchserve                0.6.0                    pypi_0    pypi
[conda] torchtext                 0.13.0                   pypi_0    pypi
[conda] torchvision               0.13.0                   pypi_0    pypi

the issue seems to be appearing after this commit .

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions