Skip to content

Robert large 8x run failed #15

@libinta

Description

@libinta

With the following cmd roberta large failed at 8x

python ../gaudi_spawn.py --world_size 8 --use_mpi run_qa.py --model_name_or_path roberta-large --gaudi_config_name ../gaudi_config.json --dataset_name squad --do_train --do_eval --per_device_train_batch_size 12 --per_device_eval_batch_size 8 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir ./roberta_large_8x_bf16_lazy --use_habana --use_lazy_mode

to make the issue easier to reproduce: add the following cmd
--save_steps 5

it's related to the save portion, need to find out which save
configuration or checkpoint, tockenizer config, special tokens

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions