Resume at the end of the last trained epoch#547
Merged
SamuelLarkin merged 1 commit intomainfrom Sep 18, 2024
Merged
Conversation
|
Review changes with SemanticDiff. Analyzed 1 of 1 files.
|
04d9d0d to
7cce58c
Compare
Contributor
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #547 +/- ##
=======================================
Coverage 74.63% 74.63%
=======================================
Files 46 46
Lines 3130 3130
Branches 510 510
=======================================
Hits 2336 2336
Misses 693 693
Partials 101 101 ☔ View full report in Codecov by Sentry. |
Collaborator
|
Yes , confirming that the fin-tune checkpoint it resuming from the end of the previous run. ( 50 steps ahead) VS how it was definitely overlapping before. I will open a new ticket for the 50 steps ahead but will close this since it is now resolved. :-) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Goal?
Fix proper resuming of
text-to-spectraining.The state at the end of the last epoch wasn't saved and resuming would be performed from the last saved checkpoint that was the last checkpoint used for validation. This was producing staggered runs as shown in
tensorboard.Fixes?
#534
Feedback sought?
merge approval
Priority?
low
Tests added?
None
How to test?
srun everyvoice train text-to-spec \ config/everyvoice-text-to-spec.yaml \ --config-args training.max_epochs=1 \Check the state of the loops
python -c 'import torch; import json; m = torch.load("logs_and_checkpoints/FeaturePredictionExperiment/save_on_train_epoch_end/checkpoints/last.ckpt", map_location=torch.device("cpu")); print(json.dumps(m["loops"]["fit_loop"]["epoch_loop.batch_progress"], indent=2))'Which will yield something like the following. You want to look at
current's values. This run used 11790 training examples split across batches of 16 examples thus, one epoch is 11790/16 ~ 736 batches per epoch. If, instead, we see 500, the defaultval_check_interval, this would mean that we didn't save at the end of the epoch.{ "total": { "ready": 4421, "completed": 4421, "started": 4421, "processed": 4421 }, "current": { "ready": 736, "completed": 736, "started": 736, "processed": 736 }, "is_last_batch": true }Try resuming for a second epoch.
srun everyvoice train text-to-spec \ config/everyvoice-text-to-spec.yaml \ --config-args training.finetune_checkpoint="logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/last.ckpt" \ --config-args training.max_epochs=2 \Use
tensorboardand check that the second run's training is NOT staggered with your first run.Confidence?
Good
Version change?
No
Related PRs?
None