Follow-up on Issues Regarding Training State Restoration from Interruptions

Hi team,

I would like to follow up on the status of the following issues. Both of these issues involve erroneous behavior that occurs when resuming from an interruption . One issue is that regardless of when training is interrupted at any given timestep, in most cases, a certain amount of data will be un-trained (https://github.com/huggingface/transformers/issues/38939). The other issue is that the random state cannot be guaranteed to be consistent when resuming from an interruption, which may affect random operations in the random sampler or collator, thus breaking consistency with a full training run (https://github.com/huggingface/transformers/issues/39215).

I have provided minimal reproducible code, a detailed description of the problem, and a possible set of fixes in the issue descriptions. However, I have not received any further response.

If you believe this direction for a fix is correct, I would be very happy to create PRs to contribute these fixes.

I hope to get some feedback on whether this solution is feasible. Thank you for your time and excellent work on this project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Follow-up on Issues Regarding Training State Restoration from Interruptions #39755

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Follow-up on Issues Regarding Training State Restoration from Interruptions #39755

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions