-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Description
Hi team,
I would like to follow up on the status of the following issues. Both of these issues involve erroneous behavior that occurs when resuming from an interruption . One issue is that regardless of when training is interrupted at any given timestep, in most cases, a certain amount of data will be un-trained (#38939). The other issue is that the random state cannot be guaranteed to be consistent when resuming from an interruption, which may affect random operations in the random sampler or collator, thus breaking consistency with a full training run (#39215).
I have provided minimal reproducible code, a detailed description of the problem, and a possible set of fixes in the issue descriptions. However, I have not received any further response.
If you believe this direction for a fix is correct, I would be very happy to create PRs to contribute these fixes.
I hope to get some feedback on whether this solution is feasible. Thank you for your time and excellent work on this project