Skip to content

Finetune T5 11B and the process is killed . exits with return code = -9[BUG] #2946

@zhilizju

Description

@zhilizju

Describe the bug
Hi, I want to finetune T5 model (11B). But the process is killed and exits with return code = -9

[2023-03-05 06:40:25,173] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698707
[2023-03-05 06:40:27,249] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698708
[2023-03-05 06:40:27,250] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698709
[2023-03-05 06:40:28,626] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698710
[2023-03-05 06:40:30,045] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698711
[2023-03-05 06:40:31,498] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698712
[2023-03-05 06:40:32,912] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698713
[2023-03-05 06:40:34,370] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698714
[2023-03-05 06:40:35,705] [ERROR] [launch.py:324:sigkill_handler] ['/home/lizhi/anaconda3/envs/tk-instruct/bin/python', '-u', 'src/run_s2s.py', '--local_rank=7', '--do_train', '--do_predict', '--predict_with_generate', '--model_name_or_path', '/home/lizhi/Tk-Instruct-main/google/t5-xxl-lm-adapt', '--max_source_length', '1024', '--max_target_length', '128', '--generation_max_length', '128', '--max_num_instances_per_task', '1', '--max_num_instances_per_eval_task', '1', '--add_task_name', 'False', '--add_task_definition', 'True', '--num_pos_examples', '2', '--num_neg_examples', '0', '--add_explanation', 'False', '--tk_instruct', 'False', '--data_dir', 'data/splits/default', '--task_dir', 'data/tasks', '--output_dir', 'output/', '--overwrite_output_dir', '--cache_dir', './cache/', '--overwrite_cache', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '5e-05', '--num_train_epochs', '1', '--lr_scheduler_type', 'constant', '--warmup_steps', '0', '--logging_strategy', 'steps', '--logging_steps', '500', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '2500', '--deepspeed', 'ds_configs/11b_stage3_offload.config', '--bf16', '--run_name', 't5-experiment'] exits with return code = -9

To Reproduce
Steps to reproduce the behavior:

The code is based on the project https://github.com/yizhongw/Tk-Instruct
Just add a new config (name it 11b_stage3_offload.config)under the folder ds_configs
The content of the new config is :

{
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": false
},
"offload_param": {
"device": "cpu",
"pin_memory": false
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": false
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

Then modify the two parameters of scripts https://github.com/yizhongw/Tk-Instruct/blob/main/scripts/train_tk_instruct.sh
: repalce the
--model_name_or_path google/t5-xl-lm-adapt
--deepspeed ds_configs/11b_stage3_offload.config
with
--model_name_or_path google/t5-xxl-lm-adapt
--deepspeed ds_configs/11b_stage3_offload.config

and add --bf16

Expected behavior
I hope I can finetune this model.

ds_report output
image

Screenshots
If applicable, add screenshots to help explain your problem.
image

System info (please complete the following information):

  • OS: [e.g. Ubuntu 20.04]
  • GPU count and types [ one machines with x8 RTX6000, 48G each GPU. ]
  • Python version 3.8.16

Launcher context
#!/bin/bash
set -x

export CUDA_DEVICE_ORDER="PCI_BUS_ID"
export TRANSFORMERS_CACHE=/home/lizhi/.cache/huggingface

port=$(shuf -i25000-30000 -n1)

deepspeed --master_port $port src/run_s2s.py
--do_train
--do_predict
--predict_with_generate
--model_name_or_path google/t5-xxl-lm-adapt
--max_source_length 1024
--max_target_length 128
--generation_max_length 128
--max_num_instances_per_task 1
--max_num_instances_per_eval_task 1
--add_task_name False
--add_task_definition True
--num_pos_examples 2
--num_neg_examples 0
--add_explanation False
--tk_instruct False
--data_dir data/splits/default
--task_dir data/tasks
--output_dir output/
--overwrite_output_dir
--cache_dir ./cache/
--overwrite_cache
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--learning_rate 5e-05
--num_train_epochs 1
--lr_scheduler_type constant
--warmup_steps 0
--logging_strategy steps
--logging_steps 500
--evaluation_strategy no
--save_strategy steps
--save_steps 2500
--deepspeed ds_configs/11b_stage3_offload.config
--bf16
--run_name t5-experiment

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtraining

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions