Skip to content

Resume at the end of the last trained epoch#547

Merged
SamuelLarkin merged 1 commit intomainfrom
dev.sl/534_resume
Sep 18, 2024
Merged

Resume at the end of the last trained epoch#547
SamuelLarkin merged 1 commit intomainfrom
dev.sl/534_resume

Conversation

@SamuelLarkin
Copy link
Copy Markdown
Collaborator

@SamuelLarkin SamuelLarkin commented Sep 12, 2024

PR Goal?

Fix proper resuming of text-to-spec training.
The state at the end of the last epoch wasn't saved and resuming would be performed from the last saved checkpoint that was the last checkpoint used for validation. This was producing staggered runs as shown in tensorboard.

Fixes?

#534

Feedback sought?

merge approval

Priority?

low

Tests added?

None

How to test?

   srun everyvoice train text-to-spec \
      config/everyvoice-text-to-spec.yaml \
      --config-args training.max_epochs=1 \

Check the state of the loops

python -c 'import torch; import json; m = torch.load("logs_and_checkpoints/FeaturePredictionExperiment/save_on_train_epoch_end/checkpoints/last.ckpt", map_location=torch.device("cpu")); print(json.dumps(m["loops"]["fit_loop"]["epoch_loop.batch_progress"], indent=2))'

Which will yield something like the following. You want to look at current's values. This run used 11790 training examples split across batches of 16 examples thus, one epoch is 11790/16 ~ 736 batches per epoch. If, instead, we see 500, the default val_check_interval, this would mean that we didn't save at the end of the epoch.

{
  "total": {
    "ready": 4421,
    "completed": 4421,
    "started": 4421,
    "processed": 4421
  },
  "current": {
    "ready": 736,
    "completed": 736,
    "started": 736,
    "processed": 736
  },
  "is_last_batch": true
}

Try resuming for a second epoch.

   srun everyvoice train text-to-spec \
      config/everyvoice-text-to-spec.yaml \
      --config-args training.finetune_checkpoint="logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/last.ckpt" \
      --config-args training.max_epochs=2 \

Use tensorboard and check that the second run's training is NOT staggered with your first run.

tensorboard --port=2024 --logdir=logs_and_checkpoints  --bind_all

Confidence?

Good

Version change?

No

Related PRs?

None

@semanticdiff-com
Copy link
Copy Markdown

semanticdiff-com bot commented Sep 12, 2024

Review changes with SemanticDiff.

Analyzed 1 of 1 files.

Filename Status
✔️ everyvoice/base_cli/helpers.py Analyzed

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Sep 17, 2024

CLI load time: 0:00.23
Pull Request HEAD: 7cce58cb74a59ca919153ce22f72e49f4ee64024
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

@codecov
Copy link
Copy Markdown

codecov bot commented Sep 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.63%. Comparing base (3a36240) to head (7cce58c).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #547   +/-   ##
=======================================
  Coverage   74.63%   74.63%           
=======================================
  Files          46       46           
  Lines        3130     3130           
  Branches      510      510           
=======================================
  Hits         2336     2336           
  Misses        693      693           
  Partials      101      101           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@SamuelLarkin SamuelLarkin changed the title [WIP] dev.sl/534 resume Resume at the end of the last trained epoch Sep 17, 2024
@marctessier
Copy link
Copy Markdown
Collaborator

Yes , confirming that the fin-tune checkpoint it resuming from the end of the previous run. ( 50 steps ahead) VS how it was definitely overlapping before.

I will open a new ticket for the 50 steps ahead but will close this since it is now resolved. :-)

@marctessier marctessier reopened this Sep 18, 2024
Copy link
Copy Markdown
Collaborator

@marctessier marctessier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good Samuel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants