-
Notifications
You must be signed in to change notification settings - Fork 272
Robert large 8x run failed #15
Copy link
Copy link
Closed
Description
With the following cmd roberta large failed at 8x
python ../gaudi_spawn.py --world_size 8 --use_mpi run_qa.py --model_name_or_path roberta-large --gaudi_config_name ../gaudi_config.json --dataset_name squad --do_train --do_eval --per_device_train_batch_size 12 --per_device_eval_batch_size 8 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir ./roberta_large_8x_bf16_lazy --use_habana --use_lazy_mode
to make the issue easier to reproduce: add the following cmd
--save_steps 5
it's related to the save portion, need to find out which save
configuration or checkpoint, tockenizer config, special tokens
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels