Skip to content

Automatically resume from the latest checkpoint #7996

@zhichenggeng

Description

@zhichenggeng

Checklist / 检查清单

  • I have searched existing issues, and this is a new feature request. / 我已经搜索过现有的 issues,确认这是一个新的 Feature Request。

Feature Request Description / Feature Request 描述

Can we support a function that when we set resume_from_checkpoint as True, the args class will try to find the latest checkpoint from the output_dir. This is an important feature because in some production environments, the submitted job might be interrupted and rerun automatically. In that case, we have to manually find the latest checkpoint and put it in the command and resubmit the job. If we can support this feature, the job can always rerun from the latest checkpoint.

Proposed behavior

  • If resume_from_checkpoint=True
  • And no checkpoint path is explicitly specified
  • Then automatically find the latest checkpoint in output_dir and resume from it
  • If no checkpoint is found, fall back to training from scratch (or raise a clear warning)

I can try to start a PR if everyone is busy with other development tasks. But I want to know if this should be added and if this is doable. Thanks.

For your reference, transformers already has get_last_checkpoint function in Trainer
https://github.com/huggingface/transformers/blob/393b4b3d28e29b4b05b19b4b7f3242a7fc893637/src/transformers/trainer.py#L2094-L2097
They used to put it in the example scripts but removed it for some reason.
https://github.com/huggingface/transformers/blob/3839d5101338bd44d35c09df76da1ec1b21964e2/examples/pytorch/language-modeling/run_clm.py#L321-L334

Pull Request / Pull Request 信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions