Skip to content

Follow-up on Issues Regarding Training State Restoration from Interruptions #39755

@rangehow

Description

@rangehow

Hi team,

I would like to follow up on the status of the following issues. Both of these issues involve erroneous behavior that occurs when resuming from an interruption . One issue is that regardless of when training is interrupted at any given timestep, in most cases, a certain amount of data will be un-trained (#38939). The other issue is that the random state cannot be guaranteed to be consistent when resuming from an interruption, which may affect random operations in the random sampler or collator, thus breaking consistency with a full training run (#39215).

I have provided minimal reproducible code, a detailed description of the problem, and a possible set of fixes in the issue descriptions. However, I have not received any further response.

If you believe this direction for a fix is correct, I would be very happy to create PRs to contribute these fixes.

I hope to get some feedback on whether this solution is feasible. Thank you for your time and excellent work on this project

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions