-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Closed
Closed
Copy link
Labels
Description
System Info
- `transformers` version: 4.22.0.dev0
- Platform: Linux-5.4.0-1072-aws-x86_64-with-debian-buster-sid
- Python version: 3.7.10
- Huggingface_hub version: 0.8.1
- PyTorch version (GPU?): 1.12.0+cu102 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Steps to reproduce the behaviour:
- Clone transformers -
git clone https://github.com/huggingface/transformers.git - move to transformers folder -
cd transformers - Install from source -
pip install . - Move to image-classification example -
cd examples/pytorch/image-classification - Train the model using fsdp
torchrun --nproc_per_node=4 run_image_classification.py --dataset_name beans --output_dir ./beans_outputs/ --remove_unused_columns False --do_train --do_eval --learning_rate 2e-5 --num_train_epochs 5 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --logging_strategy steps --logging_steps 10 --evaluation_strategy epoch --save_strategy epoch --load_best_model_at_end True --save_total_limit 3 --seed 1337 --fsdp "full_shard auto_wrap"
Expected behavior
Model should get finetuned and saved successfully.
However, the following error is produced
[INFO|trainer.py:1949] 2022-08-07 08:35:00,771 >> Loading best model from ./beans_outputs/checkpoint-165 (score: 0.19044387340545654).
Traceback (most recent call last):
File "run_image_classification.py", line 384, in <module>
main()
File "run_image_classification.py", line 358, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1509, in train
ignore_keys_for_eval=ignore_keys_for_eval,
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1867, in _inner_training_loop
self._load_best_model()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1992, in _load_best_model
load_result = model.load_state_dict(state_dict, strict=False)
TypeError: load_state_dict() got an unexpected keyword argument 'strict'
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "run_image_classification.py", line 384, in <module>
File "run_image_classification.py", line 384, in <module>
File "run_image_classification.py", line 384, in <module>
main()main()
File "run_image_classification.py", line 358, in main
File "run_image_classification.py", line 358, in main
main()
File "run_image_classification.py", line 358, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1509, in train
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1509, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1509, in train
ignore_keys_for_eval=ignore_keys_for_eval,ignore_keys_for_eval=ignore_keys_for_eval,
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1867, in _inner_training_loop
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1867, in _inner_training_loop
ignore_keys_for_eval=ignore_keys_for_eval,
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1867, in _inner_training_loop
self._load_best_model()self._load_best_model()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1992, in _load_best_model
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1992, in _load_best_model
self._load_best_model()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1992, in _load_best_model
load_result = model.load_state_dict(state_dict, strict=False)load_result = model.load_state_dict(state_dict, strict=False)
TypeErrorTypeError: : load_state_dict() got an unexpected keyword argument 'strict'load_state_dict() got an unexpected keyword argument 'strict'
load_result = model.load_state_dict(state_dict, strict=False)
TypeError: load_state_dict() got an unexpected keyword argument 'strict'
Full example log -
fsdp_error.txt
Torch environment details:
PyTorch version: 1.12.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.22.4
Libc version: glibc-2.10
Python version: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-1072-aws-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mlflow-torchserve==0.2.0
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.6
[pip3] numpydoc==1.1.0
[pip3] pytorch-kfp-components==0.1.0
[pip3] pytorch-lightning==1.6.5
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.12.0
[pip3] torch-model-archiver==0.6.0
[pip3] torch-optimizer==0.1.0
[pip3] torch-workflow-archiver==0.2.4b20220511
[pip3] torchdata==0.4.0
[pip3] torchmetrics==0.7.3
[pip3] torchserve==0.6.0
[pip3] torchtext==0.13.0
[pip3] torchvision==0.13.0
[conda] blas 1.0 mkl
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py37he8ac12f_0
[conda] mkl_fft 1.2.1 py37h54f3939_0
[conda] mkl_random 1.1.1 py37h0573a6f_0
[conda] mlflow-torchserve 0.2.0 pypi_0 pypi
[conda] numpy 1.21.6 pypi_0 pypi
[conda] numpydoc 1.1.0 pyhd3eb1b0_1
[conda] pytorch-kfp-components 0.1.0 pypi_0 pypi
[conda] pytorch-lightning 1.6.5 pypi_0 pypi
[conda] pytorch-ranger 0.1.1 pypi_0 pypi
[conda] torch 1.12.0 pypi_0 pypi
[conda] torch-model-archiver 0.6.0 pypi_0 pypi
[conda] torch-optimizer 0.1.0 pypi_0 pypi
[conda] torch-workflow-archiver 0.2.4b20220511 pypi_0 pypi
[conda] torchdata 0.4.0 pypi_0 pypi
[conda] torchmetrics 0.7.3 pypi_0 pypi
[conda] torchserve 0.6.0 pypi_0 pypi
[conda] torchtext 0.13.0 pypi_0 pypi
[conda] torchvision 0.13.0 pypi_0 pypi
the issue seems to be appearing after this commit .