Add support for `StatefulDataLoader` #2410

joecummings · 2025-02-18T19:59:43Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

This PR adds support for the StatefulDataLoader class from the PyTorch library torchdata.

FAQs

Why? This is necessary for resuming from checkpoints mid-epoch - a feature set we will need to support for step-based checkpointing.
Only full finetune single device? Yeah, this is more to see how we'll integrate this as a POC and make sure that the tests can pass. Once this is merged, I'll add support for the rest of the recipes.
Hardcoding iterator_finished? Yeah, we'll change this when we actually move to step-based checkpointing, but right now we expect to save on the epoch boundaries so if the epoch is cut short, then we expect that the dataloader will restart its shuffling and data provided as if the iterator has finished going through all the samples. Huge, huge thanks to @ramanishsingh for helping me debug this last issue.
You removed the sampler?!! The StatefulDataLoader creates a batched random sampler of it's own so there's no need for us to be creating a new one in this context. Less code means my life is easier.
WTH, you removed the check for max_steps is None. Won't that break? Nope, Python is dumb. Check it:

>>> 1 == None
False
>>> 0 == None
False
>>> 5 == None
False
>>> 1000000000000000 == None
False

Changelog

What are the changes made in this PR?

Import StatefulDataloader and replace it's usage in the _setup_data method
Checkpoint the dataloader state dict
Load the dataloader state dict if we are trying to resume from the checkpoint
Update tests to match (yes the numbers changed b/c we are using a new random state - it's fine)

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2025-02-18T19:59:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2410

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c7decd7 with merge base 952078e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

RdoubleA · 2025-02-20T21:14:40Z

recipes/full_finetune_single_device.py

            batch_size=cfg.batch_size,
            collate_fn=collate_name,
+            dataloader_state_dict=(
+                ckpt_dict[training.DATALOADER_KEY]


should check if this key even exists for BC

This might just have to break BC b/c without this, the user will not be able to successfully resume training at any point.

tests/recipes/test_full_finetune_single_device.py

RdoubleA · 2025-02-20T21:16:06Z

recipes/full_finetune_single_device.py

-            num_replicas=1,
-            rank=0,
-            shuffle=shuffle,
-            seed=0,


so how is seed passed to StatefulDataLoader in this case?

codecov-commenter · 2025-02-24T21:13:05Z

Codecov Report

Attention: Patch coverage is 5.88235% with 16 lines in your changes missing coverage. Please review.

Project coverage is 23.16%. Comparing base (e6cba25) to head (c7decd7).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
recipes/full_finetune_single_device.py	0.00%	11 Missing ⚠️
tests/recipes/test_full_finetune_single_device.py	0.00%	5 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2410       +/-   ##
===========================================
- Coverage   63.87%   23.16%   -40.72%     
===========================================
  Files         368      379       +11     
  Lines       21873    22706      +833     
===========================================
- Hits        13971     5259     -8712     
- Misses       7902    17447     +9545

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Add core dependency on stable torchdata

481fad0

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 18, 2025

Add support for SDL in single device full finetune

fe0380d

joecummings force-pushed the add-support-for-stateful-dl branch from 22e007c to fe0380d Compare February 18, 2025 20:07

joecummings added 2 commits February 18, 2025 12:17

Type nonsense

72bc377

StatefulDataLoader

bf17f90

RdoubleA reviewed Feb 20, 2025

View reviewed changes

joecummings added 5 commits February 20, 2025 14:53

Update all values for worker=0

38ceeae

Sanity check

238f702

Break out of loop before starting next batch

7daa21d

Merge branch 'main' into add-support-for-stateful-dl

33e62f4

Harcode iterator_finished to True

2c8be38

joecummings requested review from SalmanMohammadi, ebsmothers, felipemello1 and pbontrager February 24, 2025 20:11

Order of operations bud

c7decd7

SalmanMohammadi approved these changes Feb 24, 2025

View reviewed changes

joecummings merged commit 7b654ea into meta-pytorch:main Feb 24, 2025
17 checks passed

This was referenced Feb 26, 2025

Add StatefulDataLoader to select other recipes #2431

Merged

Add StatefulDataloader to remainder of recipes #2439

Closed

joecummings added a commit to joecummings/torchtune that referenced this pull request Feb 27, 2025

Add support for StatefulDataLoader (meta-pytorch#2410)

6322b82

joecummings added a commit to joecummings/torchtune that referenced this pull request Feb 27, 2025

Add support for StatefulDataLoader (meta-pytorch#2410)

79c0001

pbontrager pushed a commit to pbontrager/torchtune that referenced this pull request Mar 17, 2025

Add support for StatefulDataLoader (meta-pytorch#2410)

de464a4

pbontrager pushed a commit that referenced this pull request Mar 17, 2025

Add support for StatefulDataLoader (#2410)

8595f5c

krammnic mentioned this pull request Apr 1, 2025

seeds are either seed=0 or the argument is not present in samplers for all recipes #2545

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for `StatefulDataLoader` #2410

Add support for `StatefulDataLoader` #2410

Uh oh!

joecummings commented Feb 18, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 18, 2025 •

edited

Loading

Uh oh!

RdoubleA Feb 20, 2025

Uh oh!

joecummings Feb 20, 2025

Uh oh!

Uh oh!

RdoubleA Feb 20, 2025

Uh oh!

codecov-commenter commented Feb 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add support for StatefulDataLoader #2410

Add support for StatefulDataLoader #2410

Uh oh!

Conversation

joecummings commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changelog

Test plan

UX

Uh oh!

pytorch-bot bot commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2410

✅ No Failures

Uh oh!

RdoubleA Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

joecummings Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RdoubleA Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Feb 24, 2025

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add support for `StatefulDataLoader` #2410

Add support for `StatefulDataLoader` #2410

joecummings commented Feb 18, 2025 •

edited

Loading

pytorch-bot bot commented Feb 18, 2025 •

edited

Loading