Skip to content

Conversation

@aeeeeeep
Copy link
Contributor

@aeeeeeep aeeeeeep commented Oct 31, 2025

  • Perform allgather operations on individual parameters instead of parameter lists.
  • Significantly reduce peak memory usage in high memory pressure scenarios.
  • Improve performance by minimizing temporary buffer requirements.
  • The behavior is enabled via a new boolean flag under the section
"zero_optimization": {
  "stage3_allgather_single_param": true 
 }
  • By default the optimization is not enabled.

@aeeeeeep aeeeeeep force-pushed the allgather_single_param branch from f580558 to 77a51f7 Compare October 31, 2025 13:53
@aeeeeeep aeeeeeep marked this pull request as draft November 1, 2025 06:21
@aeeeeeep aeeeeeep force-pushed the allgather_single_param branch from 3613b25 to d55f736 Compare November 1, 2025 16:26
@aeeeeeep aeeeeeep marked this pull request as ready for review November 1, 2025 16:27
@aeeeeeep aeeeeeep force-pushed the allgather_single_param branch from d55f736 to 30814fa Compare November 1, 2025 16:28
@sfc-gh-truwase
Copy link
Collaborator

@aeeeeeep thanks for this contribution. Are you able to share some data showing the benefits of this optimization?

@aeeeeeep
Copy link
Contributor Author

aeeeeeep commented Nov 1, 2025

@aeeeeeep thanks for this contribution. Are you able to share some data showing the benefits of this optimization?

Thanks for your feedback! I’ll share detailed data within the next few days.

@aeeeeeep aeeeeeep force-pushed the allgather_single_param branch from efecf04 to d6cd73d Compare November 1, 2025 17:38
aeeeeeep and others added 6 commits November 13, 2025 12:40
Signed-off-by: aeeeeeep <[email protected]>
Make it very clear that `TiledMLP`'s memory saving has a cost of
recomputing forward.

Signed-off-by: aeeeeeep <[email protected]>
…eepspeedai#7659)

fixes deepspeedai#7650

adding a `value.dim()>0` check to prevent slicing of 0-dim tensors

cc @sfc-gh-truwase

Signed-off-by: Naveenraj Kamalakannan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: aeeeeeep <[email protected]>
Signed-off-by: aeeeeeep <[email protected]>
@aeeeeeep aeeeeeep force-pushed the allgather_single_param branch from 52ae961 to c943e0e Compare November 13, 2025 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants