Implement step based checkpointing #2869

bogdansalyp · 2025-07-07T16:36:00Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Closes #2105. This is a widely requested feature that allows users to have greater control over checkpointing frequency in torchtune.

TODO: Add commentary on design decisions. Acknowledge spaghetti code. Beg forgiveness.

Changelog

Update FullModelHFCheckpointer to accept a step parameter when saving a checkpoint. Use that step to designate the checkpoint folder name. Keep epoch_{} as a fall-back for BC.
Modify the full_finetune_single_device.py recipe to utilize step-based checkpointing.
Add tests for `full_finetune_single_device.py`` recipe w/ step-based checkpointing.

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Evidence of correct number of checkpoints being saved

(joe-torchtune) [[email protected] ~/projects/joe-torchtune (impl-step-based-ckpt)]$ ls /tmp/torchtune/llama3_2_1B/full_single_device/
step_100  step_125  step_150  step_175  step_200  step_25  step_50  step_75  torchtune_config.yaml

Evidence of correct resuming from ckpt mid-epoch

Evidence of correct resuming from ckpt at epoch boundary

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

…d get resume working w/ StatefulDataLoader

…2409)

…orch#2382)

Co-authored-by: Felipe Mello <[email protected]> Co-authored-by: ebsmothers <[email protected]> Co-authored-by: salman <[email protected]>

…eta-pytorch#2412)

…-pytorch#2366)

…chtune into fix/torchtune_ckpt_tests

codecov-commenter · 2025-07-11T15:55:22Z

Codecov Report

Attention: Patch coverage is 17.20648% with 409 lines in your changes missing coverage. Please review.

Project coverage is 59.24%. Comparing base (7d30d4c) to head (243a6ab).
Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
torchtune/training/checkpointing/_checkpointer.py	49.49%	50 Missing ⚠️
...htune/training/checkpointing/_checkpoint_client.py	0.00%	47 Missing ⚠️
recipes/full_finetune_single_device.py	0.00%	46 Missing ⚠️
recipes/full_finetune_distributed.py	0.00%	38 Missing ⚠️
tests/recipes/test_full_finetune_single_device.py	21.62%	29 Missing ⚠️
recipes/lora_finetune_single_device.py	0.00%	25 Missing ⚠️
recipes/knowledge_distillation_single_device.py	0.00%	21 Missing ⚠️
tests/recipes/test_qat_distributed.py	19.23%	21 Missing ⚠️
recipes/full_dpo_distributed.py	0.00%	20 Missing ⚠️
recipes/knowledge_distillation_distributed.py	0.00%	18 Missing ⚠️
... and 16 more

Additional details and impacted files

@@            Coverage Diff             @@
##            main    #2869       +/-   ##
==========================================
+ Coverage   5.12%   59.24%   +54.12%     
==========================================
  Files        375      439       +64     
  Lines      22956    27407     +4451     
==========================================
+ Hits        1177    16238    +15061     
+ Misses     21779    11169    -10610

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

joecummings

Mostly nits, just one question about "epoch" usage in checkpointer

joecummings · 2025-07-15T19:01:19Z

recipes/configs/llama3_2/1B_full_single_device.yaml

  model_type: LLAMA3_2
+  keep_last_n_checkpoints: 2
 resume_from_checkpoint: False
+save_every_n_steps: 25


These were just test values, can probably remove

joecummings · 2025-07-15T19:01:30Z

recipes/configs/llama3_2/1B_full_single_device.yaml

 loss:
  _component_: torchtune.modules.loss.LinearCrossEntropyLoss
-max_steps_per_epoch: null
+max_steps_per_epoch: 100


Same here, just a test value

joecummings · 2025-07-15T19:03:07Z

torchtune/training/checkpointing/_checkpoint_client.py


            if adapter_only:
+                save_path = dcp_saver.output_dir
+                if dir_prefix == "step":


Might be worth a comment that this is fairly hacky b/c we need to infer BC with epochs and potentially could be refactored

joecummings · 2025-07-15T19:04:32Z

torchtune/training/checkpointing/_checkpoint_client.py

        adapter_only: bool = False,
        single_device: bool = False,
+        *,
+        full_tensors: bool = True,


Maybe add a comment in the docstring explaining what full_tensors are and when you might want to use them.

joecummings · 2025-07-15T19:05:32Z

torchtune/training/checkpointing/_checkpointer.py

            _ = state_dict.pop(training.ADAPTER_CONFIG, None)
            output_path = Path.joinpath(
-                self._output_dir, RECIPE_STATE_DIRNAME, "recipe_state.pt"
+                self._output_dir, f"epoch_{epoch}", "recipe_state.pt"


Why just epoch here?

It's a debugger torchtune checkpointer, not a HF one

joecummings and others added 30 commits February 27, 2025 13:41

Add helper functions for pruning old checkpoints

9d45c6b

Step based checkpointing and tests

de08541

Fix helper function tests

19a50af

Silly linting rule

0656313

Allow kwargs in all checkpointers for BC with step

857bec1

Stub

1e2ac1f

Resume from recipe state in step_x

d1d79da

Correct step for resume from checkpoint

0e531ad

fix: Fixed global step restoration from recipe_state

53a6d38

At some point, god will make me pay for my sins

fa927c9

Remove the recipe state checkpointing *only* on intermediate paths an…

ab18acd

…d get resume working w/ StatefulDataLoader

Introduce RecipeStateCheckpointPeriod

33a788c

Remove the need for sampler

8bbe463

Update TOML file with more description matching README (meta-pytorch#…

0537afd

…2409)

Add core dependency on stable torchdata (meta-pytorch#2408)

380da38

Update QAT tutorial (meta-pytorch#2396)

c8c1027

MPS memory usage support (meta-pytorch#2406)

721502f

Update docs and docstrings related to Llama3VisionTransform (meta-pyt…

d2eefb1

…orch#2382)

R1-Style distributed GRPO (meta-pytorch#2326)

9e61aaf

Co-authored-by: Felipe Mello <[email protected]> Co-authored-by: ebsmothers <[email protected]> Co-authored-by: salman <[email protected]>

Add support for StatefulDataLoader (meta-pytorch#2410)

79c0001

Update KVCache maximum sequence length configuration in PPO recipe (m…

abb34fc

…eta-pytorch#2412)

Refactor load_image to return torch.Tensor instead of PIL.Image (meta…

2803b92

…-pytorch#2366)

Add StatefulDataLoader to select other recipes (meta-pytorch#2431)

52d1d0c

Update README.md w/ GRPO (meta-pytorch#2443)

9832db1

Add helper functions for pruning checkpoints (meta-pytorch#2445)

05beb66

Add helper functions for pruning old checkpoints

0adc679

Step based checkpointing and tests

8cfc047

Correct step for resume from checkpoint

54965a4

fix: Fixed global step restoration from recipe_state

c871e8d

At some point, god will make me pay for my sins

18e5eb7

bogdansalyp marked this pull request as ready for review July 10, 2025 16:57

bogdansalyp added 10 commits July 10, 2025 10:02

fix: Run github workflows not in parallel

6e0ea40

Merge branch 'main' into fix/torchtune_ckpt_tests

7b849f1

test: Removed lora dpo resume test

5fa4f86

fix: Linter

2f3eded

fix: Linter

69ba3d1

fix: Revert GPU tests workflow file

3c4941b

fix: Revert barrier in cpu_dict

c6f114d

Merge branch 'fix/torchtune_ckpt_tests' of github.com:bogdansalyp/tor…

f550235

…chtune into fix/torchtune_ckpt_tests

fix: Revert debugging drop_last=False

df21132

fix: Revert deleting lora dpo tests

243a6ab

bogdansalyp added 5 commits July 11, 2025 10:51

fix: Import statements

0bcb5e1

fix: Linter

93780e4

fix: Removed lora recipes from tests

ede10fe

fix: Linter

28d9052

fix: Returned lora dpo tests

9caa4f2

bogdansalyp changed the title ~~[DEBUG] Step-based checkpointing fixes~~ Implement step based checkpointing Jul 15, 2025

bogdansalyp added 3 commits July 15, 2025 15:56

fix: Linter

1b5c154

fix: Removed extra logs and added comments

39c2f4a

fix: Fixed intermediate ckpts condition

d12b60c

bogdansalyp mentioned this pull request Jul 15, 2025

Implement step based checkpointing #2384

Closed

13 tasks

fix: Tests to pass

2dfd061

bogdansalyp marked this pull request as draft July 15, 2025 18:12

bogdansalyp marked this pull request as ready for review July 15, 2025 18:40

bogdansalyp requested review from felipemello1 and joecummings July 15, 2025 18:44

joecummings approved these changes Jul 15, 2025

View reviewed changes

bogdansalyp merged commit e43b6e6 into meta-pytorch:main Jul 15, 2025
14 checks passed

bogdansalyp mentioned this pull request Jul 16, 2025

fix: Removed test values and added comments to step-based ckpt commit #2884

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement step based checkpointing #2869

Implement step based checkpointing #2869

Uh oh!

bogdansalyp commented Jul 7, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Jul 11, 2025

Uh oh!

joecummings left a comment

Uh oh!

joecummings Jul 15, 2025

Uh oh!

joecummings Jul 15, 2025

Uh oh!

joecummings Jul 15, 2025

Uh oh!

joecummings Jul 15, 2025

Uh oh!

joecummings Jul 15, 2025

Uh oh!

bogdansalyp Jul 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Implement step based checkpointing #2869

Implement step based checkpointing #2869

Uh oh!

Conversation

bogdansalyp commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changelog

Test plan

UX

Uh oh!

codecov-commenter commented Jul 11, 2025

Codecov Report

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

joecummings Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

joecummings Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

joecummings Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

joecummings Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

joecummings Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

bogdansalyp Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

bogdansalyp commented Jul 7, 2025 •

edited

Loading