-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Checklist / 检查清单
- I have searched existing issues, and this is a new feature request. / 我已经搜索过现有的 issues,确认这是一个新的 Feature Request。
Feature Request Description / Feature Request 描述
Can we support a function that when we set resume_from_checkpoint as True, the args class will try to find the latest checkpoint from the output_dir. This is an important feature because in some production environments, the submitted job might be interrupted and rerun automatically. In that case, we have to manually find the latest checkpoint and put it in the command and resubmit the job. If we can support this feature, the job can always rerun from the latest checkpoint.
Proposed behavior
- If resume_from_checkpoint=True
- And no checkpoint path is explicitly specified
- Then automatically find the latest checkpoint in output_dir and resume from it
- If no checkpoint is found, fall back to training from scratch (or raise a clear warning)
I can try to start a PR if everyone is busy with other development tasks. But I want to know if this should be added and if this is doable. Thanks.
For your reference, transformers already has get_last_checkpoint function in Trainer
https://github.com/huggingface/transformers/blob/393b4b3d28e29b4b05b19b4b7f3242a7fc893637/src/transformers/trainer.py#L2094-L2097
They used to put it in the example scripts but removed it for some reason.
https://github.com/huggingface/transformers/blob/3839d5101338bd44d35c09df76da1ec1b21964e2/examples/pytorch/language-modeling/run_clm.py#L321-L334
Pull Request / Pull Request 信息
No response