Fix issue of wrong number of tokens per GPUs affecting loss normalization in trainer.py #40610

SamuelBarryCS · 2025-09-02T05:25:27Z

What:

Fix issue witnessed in Trainer.training_step incorrectly normalizes mean token loss when n_gpu > 1 #37474: when using >1 GPU, the loss does not get normalized correctly
Cause: each GPUs sees num_items_in_batch = cst = total number of tokens (instead of total number of tokens / n_gpus)
Fix: adjust num_items_in_batch to be set to the correct value before being fed to training_step & compute_loss.

Test performed:

On top of the updated trainer.py, I'm pushing 3 temporary files that are meant to be deleted after review and before merging: test_simple_demo.py - a test file and trainer_old.py / trainer_new.py - respectively corresponding to the version of trainer.py before / after my commits but with logging for testing purpose
Ran test with 1 & 2 GPUs to reproduce results from Trainer.training_step incorrectly normalizes mean token loss when n_gpu > 1 #37474, and output is as follow:

1 GPU - behavior of both implementations is correct:


Found 1 different GPUs
----------------------------------------
Testing OLD Trainer
----------------------------------------
Starting OLD trainer...
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 0}.
{'loss': 10.4322, 'grad_norm': 11.936039924621582, 'learning_rate': 0.0001, 'epoch': 1.0}
{'loss': 9.8546, 'grad_norm': 8.70962905883789, 'learning_rate': 5e-05, 'epoch': 2.0}
{'train_runtime': 0.4988, 'train_samples_per_second': 60.144, 'train_steps_per_second': 4.01, 'train_loss': 10.143393516540527, 'epoch': 2.0}

----------------------------------------
Testing NEW Trainer
----------------------------------------
Starting NEW trainer...
{'loss': 10.4322, 'grad_norm': 11.936039924621582, 'learning_rate': 0.0001, 'epoch': 1.0}
{'loss': 9.8546, 'grad_norm': 8.70962905883789, 'learning_rate': 5e-05, 'epoch': 2.0}
{'train_runtime': 0.2013, 'train_samples_per_second': 149.029, 'train_steps_per_second': 9.935, 'train_loss': 10.143393516540527, 'epoch': 2.0}

2 GPUS: previous implementation (i.e. "old trainer") incorrectly normalizes the loss, whereas the new one fixes it:


Found 2 different GPUs
----------------------------------------
Testing OLD Trainer
----------------------------------------
Starting OLD trainer...
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 0}.
{'loss': 5.2161, 'grad_norm': 5.968019962310791, 'learning_rate': 0.0001, 'epoch': 1.0}
{'loss': 4.9273, 'grad_norm': 4.354814529418945, 'learning_rate': 5e-05, 'epoch': 2.0}
{'train_runtime': 1.6086, 'train_samples_per_second': 37.3, 'train_steps_per_second': 1.243, 'train_loss': 5.071696758270264, 'epoch': 2.0}

----------------------------------------
Testing NEW Trainer
----------------------------------------
Starting NEW trainer...
{'loss': 10.4322, 'grad_norm': 11.936039924621582, 'learning_rate': 0.0001, 'epoch': 1.0}
{'loss': 9.8546, 'grad_norm': 8.70962905883789, 'learning_rate': 5e-05, 'epoch': 2.0}
{'train_runtime': 0.1692, 'train_samples_per_second': 354.581, 'train_steps_per_second': 11.819, 'train_loss': 10.143393516540527, 'epoch': 2.0}

How to review:

Read diff of trainer.py (ignore trainer_old.py & trainer_new.py that are just meant for testing purpose)
Run test_simple_demo.py on 1 and then n > 1 GPUs to confirm my test

TODO/ Next:

NA

SamuelBarryCS · 2025-09-02T05:34:38Z

Hey @SunMarc - can you please review when you get the time ?
Lint & tests will probably fail because of the 3 temporary files, but you can still review the implementation/ testing and I will fix these details after.
Thanks!

SunMarc

Hey @SamuelBarryCS, thanks for the PR but this is not the right fix unfortunately but your PR helped me to get started on this issue ! I've opened a PR #40799, maybe you can have a look.

SamuelBarryCS · 2025-09-11T04:52:12Z

Hey @SunMarc
Thanks for your comment! You're right, your PR seems to be addressing the true cause of the issue.
Feel free to tag me somewhere else if you want me to take a look at an issue, I have time to contribute these days.
Cheers!

SamuelBarryCS added 4 commits September 1, 2025 22:07

Fix issue of wrong number of tokens per GPUs

89afbb6

Push test files

ac778bb

Clean

a741158

Clean

2b311b3

SamuelBarryCS marked this pull request as ready for review September 2, 2025 05:33

github-actions bot requested review from ArthurZucker and Rocketknight1 September 2, 2025 05:33

SamuelBarryCS changed the title ~~[WIP] Fix issue of wrong number of tokens per GPUs affecting loss normalization~~ Fix issue of wrong number of tokens per GPUs affecting loss normalization Sep 2, 2025

SamuelBarryCS changed the title ~~Fix issue of wrong number of tokens per GPUs affecting loss normalization~~ Fix issue of wrong number of tokens per GPUs affecting loss normalization in trainer.py Sep 2, 2025

Improve comments

bb852df

SunMarc mentioned this pull request Sep 10, 2025

[Trainer] Fix DP loss #40799

Merged

SunMarc reviewed Sep 10, 2025

View reviewed changes

SamuelBarryCS closed this Sep 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix issue of wrong number of tokens per GPUs affecting loss normalization in trainer.py #40610

Fix issue of wrong number of tokens per GPUs affecting loss normalization in trainer.py #40610

Uh oh!

SamuelBarryCS commented Sep 2, 2025 •

edited

Loading

Uh oh!

SamuelBarryCS commented Sep 2, 2025

Uh oh!

SunMarc left a comment

Uh oh!

SamuelBarryCS commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix issue of wrong number of tokens per GPUs affecting loss normalization in trainer.py #40610

Fix issue of wrong number of tokens per GPUs affecting loss normalization in trainer.py #40610

Uh oh!

Conversation

SamuelBarryCS commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What:

Test performed:

1 GPU - behavior of both implementations is correct:

2 GPUS: previous implementation (i.e. "old trainer") incorrectly normalizes the loss, whereas the new one fixes it:

How to review:

TODO/ Next:

Uh oh!

SamuelBarryCS commented Sep 2, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SamuelBarryCS commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SamuelBarryCS commented Sep 2, 2025 •

edited

Loading