Register allgather/reducescatter buffers with symm memory#8934
Register allgather/reducescatter buffers with symm memory#8934nvcastet wants to merge 1 commit intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @nvcastet, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've implemented changes to optimize distributed communication operations, specifically AllGather and ReduceScatter, by registering their buffers with symmetric memory. This optimization is particularly beneficial for maximum-throughput configurations like data-parallel attention, and my benchmarks show a 4.7% end-to-end speedup.
Highlights
- Enhanced Symmetric Memory Management: I've refined the symmetric memory allocation mechanism by introducing a new set to track registered tensor data pointers and a utility function to check if a tensor resides in symmetric memory. The use_symmetric_memory context manager now offers more granular control with disabled and disable_war flags, allowing for conditional application and specific workarounds for older PyTorch versions.
- Optimized Distributed Operations: I've updated the core distributed communication functions (all_reduce, reduce_scatter_tensor, _all_gather_into_tensor) to intelligently leverage symmetric memory when available, falling back to standard PyTorch distributed operations otherwise. This ensures that memory-registered tensors benefit from faster communication paths.
- Strategic Symmetric Memory Integration: I've strategically applied symmetric memory registration across various model layers, including layernorm outputs, gathered_buffer allocations, linear layer outputs, and embedding lookups. This integration is often conditional, enabling symmetric memory only when it provides a performance benefit (e.g., when tensor parallelism is active or specific padding modes are used), and can be explicitly disabled via new parameters.
- Refined Attention and MoE Logic: I've added checks to attention-related operations (reduce_scatter_tensor, all_gather_into_tensor) to ensure they only execute when tensor parallelism is greater than one, preventing redundant operations. Notably, I've also removed symmetric memory usage from MoE layer forward passes, indicating a targeted application of this optimization.
- Performance Validation: The changes are backed by benchmark results demonstrating a 4.7% end-to-end speedup, validating the effectiveness of registering allgather/reducescatter buffers with symmetric memory for high-throughput scenarios.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces optimizations for AllGather and ReduceScatter operations by leveraging symmetric memory, which shows a notable performance improvement. The implementation is generally solid, replacing monkey-patching with a cleaner tracking mechanism for symmetric memory tensors and adding flags for better control. However, I've identified a potential issue in python/sglang/srt/models/deepseek_v2.py where a tensor allocated in symmetric memory is not being tagged correctly. This would cause the subsequent collective communication to fall back to a less optimal path, negating the intended performance gain. Addressing this should ensure the full benefit of the optimization is realized.
There was a problem hiding this comment.
The final_hidden_states_out tensor is allocated within the use_symmetric_memory context, but it is not tagged using sm.tag(). This means it won't be recognized as a symmetric memory tensor, and subsequent operations like tensor_model_parallel_all_reduce will not use the optimized pynccl path. This seems to undermine the performance benefits of using symmetric memory here.
A similar issue is present in the forward_normal method as well.
| with use_symmetric_memory( | |
| parallel_state.get_tp_group(), disabled=disable_symmetric_memory | |
| ) as sm: | |
| final_hidden_states_out = torch.empty_like(final_hidden_states) | |
| torch.add(final_hidden_states, shared_output, out=final_hidden_states_out) | |
| final_hidden_states = final_hidden_states_out | |
| with use_symmetric_memory( | |
| parallel_state.get_tp_group(), disabled=disable_symmetric_memory | |
| ) as sm: | |
| final_hidden_states_out = torch.empty_like(final_hidden_states) | |
| sm.tag(final_hidden_states_out) | |
| torch.add(final_hidden_states, shared_output, out=final_hidden_states_out) | |
| final_hidden_states = final_hidden_states_out |
There was a problem hiding this comment.
Yes it is tag a few lines below
There was a problem hiding this comment.
Why do you tag the tensor outside of the with scope?
There was a problem hiding this comment.
The with scope make all pytorch allocations under it come from the symmetric memory pool.
The .tag(tensor) is just book-keeping to flag a tensor that has been allocated via symmetric memory so that when we call a collective on it we select NCCL to get best perf instead of alternative custom kernels.
But I can move it there if it is clearer for people and AIs. :)
There was a problem hiding this comment.
Does the input need to be in symmetric memory too?
There was a problem hiding this comment.
Yes Good point I should probably check for that too.
There was a problem hiding this comment.
I think line 363 if get_tensor_model_parallel_world_size() == get_attention_dp_size(): is equivalent to get_attention_tp_size() == 1 since attn_tp_size = tp_size // dp_size. So we shouldn't need this change.
Maybe change line 363 to get_attention_tp_size() == 1 so its more clear?
There was a problem hiding this comment.
Is disable_war used here because this is outside of the cuda graph?
There was a problem hiding this comment.
Those allocations associated with stream zero which conflicts with the WAR for pre 2.8 pytorch.
For 2.8 and beyond, it is a noop.
There was a problem hiding this comment.
Not sure if we need to touch deepep path
There was a problem hiding this comment.
yes probably not since i did not test it.
There was a problem hiding this comment.
It would be nice if we could get this from a helper function to avoid needing to pass it everywhere. We have forward_batch here but we would also need the layer_communicator in order to check should_use_reduce_scatter...
There was a problem hiding this comment.
I agree maybe adding an attribute to forward batch? but in another PR review, it was mentioned avoiding passing big object like forward_batch to know explicitly what is accessed by the function.
There was a problem hiding this comment.
You may refer to our recent refactor in dp_attention.py. We prepared some util functions to avoid tedious coding efforts. A similar util function can be prepared for symmetric_memory.
There was a problem hiding this comment.
Thanks let me have a look.
a8050e5 to
b2bdbcf
Compare
merrymercy
left a comment
There was a problem hiding this comment.
This PR changed the model forward code too much. Some changes are not intuitive (e.g., which tensor to tag, which region to use with use_symmetric_memory.
Is it possible to make it more transparent so that we get this feature without change any model forward code? (even logits_processor.py)
There was a problem hiding this comment.
Why do you tag the tensor outside of the with scope?
Thanks for taking the time to review @merrymercy ! An other option is to ignore code sections where perf gain is small. I believe i could ignore comms inside @merrymercy Could you provide some guidance on the direction to take? As a reference, here is the MNNVL NCCL allreduce standalone perf gain using this feature: |
| ) | ||
| else: | ||
| self.gathered_buffer = torch.empty_like(self.gathered_buffer) | ||
| with use_symmetric_memory(get_tp_group()) as sm: |
There was a problem hiding this comment.
When is symmetric memory efficient? I'm a little bit confused here. Why we do not check padding mode here?
There was a problem hiding this comment.
It should be always more efficient. The problem is that we don't want to allocate buffers that are not symmetric across GPU (otherwise we get undefined behavior) which can happen when we use the symmetric context manager across a big code section with DP-attention.
Here specifically the only allocation under the context manager is self.gathered_buffer which always has the same size across GPUs.
There was a problem hiding this comment.
You may refer to our recent refactor in dp_attention.py. We prepared some util functions to avoid tedious coding efforts. A similar util function can be prepared for symmetric_memory.
trevor-m
left a comment
There was a problem hiding this comment.
Thanks, overall looks good just some minor questions.
ba38114 to
f08a67f
Compare
| def is_symmetric_memory_tensor(tensor: torch.Tensor): | ||
| if not is_symmetric_memory_enabled(): | ||
| return False | ||
| for segment in get_nccl_mem_pool().snapshot(): |
There was a problem hiding this comment.
It's great that we no longer need to tag the tensors, but I'm concerned about the efficiency of this check. Collecting the snapshot seems to do a lot and then we also have to iterate over all segments/blocks.
- Do you know if basic caching/memoization in this method would work using
tensor.untyped_storage.data_ptr()as the key? Will the storage change from iteration to iteration or remain consistent? - If that doesn't work, could you measure how long this check takes?
There was a problem hiding this comment.
It would change, depending on what gets allocated in the pool But we could at least cache the snapshot at the exit of the context manager.
- Yes I could time the call.
There was a problem hiding this comment.
133 us (not cached) vs 5us (cached)
So I made the changes to cache it at context manager exit.
|
@merrymercy Do you mind having another look? |
|
Closing in favor of #9358 |

Motivation
Speedup AllGather and ReduceScatter for max-throughput configs (DP-attention)
Benchmark & Profiling
E2E Speedup: 4.7%
Baseline
Server:
Client:
With this PR
Checklist