Skip to content

fix: avoid AccumulateGrad stream mismatch by using default stream for DDP init#21579

Open
s-zx wants to merge 1 commit intoLightning-AI:masterfrom
s-zx:fix/21567-accumulate-grad-stream-mismatch
Open

fix: avoid AccumulateGrad stream mismatch by using default stream for DDP init#21579
s-zx wants to merge 1 commit intoLightning-AI:masterfrom
s-zx:fix/21567-accumulate-grad-stream-mismatch

Conversation

@s-zx
Copy link

@s-zx s-zx commented Mar 10, 2026

Summary

Fixes the AccumulateGrad stream mismatch warning when training with Fabric + DDP, especially with gradient accumulation (no_backward_sync). The warning occurred because DDP was initialized inside torch.cuda.stream(torch.cuda.Stream()), creating AccumulateGrad nodes on a non-default stream that did not match subsequent forwards/backwards.

Root Cause

DDP setup used a custom CUDA stream context (from PR #17334) when wrapping the model. This caused the AccumulateGrad nodes to be created on a non-default stream. When backward ran on the default stream (or vice versa), PyTorch emitted:

The AccumulateGrad node's stream does not match the stream of the node that produced the incoming gradient... To resolve the mismatch, ensure that DDP initialization is performed under the same stream as subsequent forwards.

Fix

Remove the custom stream context so DDP initialization runs on the default stream, matching subsequent forwards/backwards as recommended by PyTorch.

  • lightning_fabric/strategies/ddp.py: Remove torch.cuda.stream(torch.cuda.Stream()) context
  • lightning/pytorch/strategies/ddp.py: Same change
  • Update test to assert default stream is used (no custom stream)

Fixes #21567


📚 Documentation preview 📚: https://pytorch-lightning--21579.org.readthedocs.build/en/21579/

… DDP init

DDP was previously wrapped in torch.cuda.stream(torch.cuda.Stream()) which
caused the AccumulateGrad node to be created on a non-default stream. This
triggered PyTorch's stream mismatch warning when running backward, especially
with Fabric and gradient accumulation (no_backward_sync).

The fix removes the custom stream context so DDP initialization runs on the
default stream, matching subsequent forwards/backwards as recommended by
PyTorch's warning message.

Fixes Lightning-AI#21567
Signed-off-by: s-zx <[email protected]>
@github-actions github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Mar 10, 2026
@codecov
Copy link

codecov bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87%. Comparing base (283ce77) to head (015e645).

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #21579   +/-   ##
=======================================
- Coverage      87%      87%   -0%     
=======================================
  Files         270      270           
  Lines       24078    24073    -5     
=======================================
- Hits        20863    20855    -8     
- Misses       3215     3218    +3     

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AccumulateGrad Warning

1 participant