Correctly support resuming from checkpoint with a dataset without length #33544

muupan · 2024-09-17T15:56:00Z

What does this PR do?

There is an inconsistency in Trainer's behavior between training from scratch and resuming from checkpoint when the given dataset has no length like datasets.IterableDataset. For a reproducible example, see #26413 (comment) . This PR fixes the inconsistency by correctly supporting resuming from checkpoint with such a dataset.

Fixes #26413

Current behavior

When training starts with a dataset without length, Trainer assumes one epoch is equal to max_steps steps and tries to train for that many steps. There are two possible scenarios.

A. If the dataset yields enough samples, the training finishes precisely after one epoch.
B. If the dataset raises StopIteration before yielding samples enough for max_steps steps, Trainer increments the current epoch and re-iterate the dataset.

When resuming from a checkpoint, Trainer simply skips the first batches until global_step of the checkpoint. In scenario A, there is no problem. In scenario B, the dataset raises StopIteration during the skipping, but Trainer does not re-iterate the dataset. Instead, it just finishes training with a warning. This is inconsistent from what happens in training from scratch, and it contradicts with what the documents about max_steps says:

transformers/src/transformers/training_args.py

Lines 301 to 304 in ac5a055

    
                   max_steps (`int`, *optional*, defaults to -1): 
        
                       If set to a positive number, the total number of training steps to perform. Overrides `num_train_epochs`. 
        
                       For a finite dataset, training is reiterated through the dataset (if all data is exhausted) until 
        
                       `max_steps` is reached.

Solution

This PR modifies the skipping behavior so that Trainer now re-iterates the dataset until it catches up global_step. A caveat is that it does not support the ignore_data_skip option, as Trainer does not know what epoch to start from. I am also concerned that the logic is becoming too complicated.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker

LysandreJik · 2024-09-18T13:59:18Z

Very impressive PR @muupan!

I'm pinging @muellerzr and @SunMarc to take a look; Zach is off for a few weeks and will take a look as soon as he's back, thank you for your patience 🙏

SunMarc · 2024-09-27T16:10:43Z

Thanks for the PR @muupan ! We will review it shortly. There is a new feature in accelerate that enable you to use a stateful dataloader, so that we don't need to iterate to resume a training. Feel free to give it a try, note that it is a very experimental support for now.

muupan · 2024-10-31T17:01:13Z

It seems like the code got broken after rebasing with main, where #34198 renamed the variable epoch_iterator. I will fix.

SunMarc · 2024-11-05T14:42:28Z

Let us know when it is done !

SunMarc · 2025-06-03T10:17:29Z

Are you still up to finish the PR @muupan ? Otherwise, I'll add it as a good second issue so the community can pick that up !

muupan · 2025-06-08T02:26:41Z

Hi, sorry for the long delay—I’m still interested in finishing this PR. I’ll try fixing it within a week.

muupan · 2025-06-09T13:38:12Z

@SunMarc I have fixed the rebasing error and added a small fix and comments. It is ready for review.

muupan · 2025-09-24T13:27:30Z

@SunMarc It has a conflict again (with #40347). I can work on it, but before doing it can we check if the strategy of this PR is ok and can be merged?

muupan mentioned this pull request Sep 17, 2024

resume_from_checkpoint function fails because "There seems to be not a single sample in your epoch_iterator" #26413

Closed

4 tasks

LysandreJik requested review from SunMarc and muellerzr and removed request for SunMarc September 18, 2024 13:59

muupan force-pushed the feature/resume-training-with-iterable-dataset branch from 6f83505 to bc56f1c Compare October 31, 2024 08:53

muupan force-pushed the feature/resume-training-with-iterable-dataset branch 3 times, most recently from e28dd5c to 82fb75f Compare June 9, 2025 13:12

muupan force-pushed the feature/resume-training-with-iterable-dataset branch from 82fb75f to c81d9e1 Compare June 12, 2025 07:27

Correctly support resuming with dataset without length

0181b10

muupan force-pushed the feature/resume-training-with-iterable-dataset branch from c81d9e1 to 0181b10 Compare June 20, 2025 08:31

Merge branch 'main' into feature/resume-training-with-iterable-dataset

4950720

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correctly support resuming from checkpoint with a dataset without length #33544

Correctly support resuming from checkpoint with a dataset without length #33544

Uh oh!

muupan commented Sep 17, 2024

Uh oh!

LysandreJik commented Sep 18, 2024

Uh oh!

SunMarc commented Sep 27, 2024

Uh oh!

muupan commented Oct 31, 2024

Uh oh!

SunMarc commented Nov 5, 2024 •

edited by ArthurZucker

Loading

Uh oh!

SunMarc commented Jun 3, 2025

Uh oh!

muupan commented Jun 8, 2025

Uh oh!

muupan commented Jun 9, 2025

Uh oh!

muupan commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	max_steps (`int`, optional, defaults to -1):
	If set to a positive number, the total number of training steps to perform. Overrides `num_train_epochs`.
	For a finite dataset, training is reiterated through the dataset (if all data is exhausted) until
	`max_steps` is reached.

Correctly support resuming from checkpoint with a dataset without length #33544

Are you sure you want to change the base?

Correctly support resuming from checkpoint with a dataset without length #33544

Uh oh!

Conversation

muupan commented Sep 17, 2024

What does this PR do?

Current behavior

Solution

Before submitting

Who can review?

Uh oh!

LysandreJik commented Sep 18, 2024

Uh oh!

SunMarc commented Sep 27, 2024

Uh oh!

muupan commented Oct 31, 2024

Uh oh!

SunMarc commented Nov 5, 2024 • edited by ArthurZucker Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SunMarc commented Jun 3, 2025

Uh oh!

muupan commented Jun 8, 2025

Uh oh!

muupan commented Jun 9, 2025

Uh oh!

muupan commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SunMarc commented Nov 5, 2024 •

edited by ArthurZucker

Loading