Optimize MoE Token Dispatch for Tensor Parallel Configurations #22993

skyloevil · 2025-08-15T17:16:41Z

Optimize MoE Token Dispatch for Tensor Parallel Configurations

Summary

This PR implements an optimization for MoE (Mixture of Experts) token dispatching in tensor parallel (TP) configurations to significantly reduce cross-rank communication overhead. The optimization achieves 2x to 8x reduction in communication by implementing leader-only token dispatching when TP > 1.

Problem

In the current implementation, when using tensor parallelism with MoE models, all DP (data parallel) ranks dispatch tokens independently, leading to redundant communication across ranks. This creates unnecessary overhead in distributed training and inference scenarios.

Solution

Core Changes

File: vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py

Added _get_effective_num_dispatchers() method:
- Calculates optimal number of dispatchers based on TP configuration
- Returns full dispatcher count for single TP (TP = 1)
- Returns proportional dispatcher count for leader ranks when TP > 1
- Returns 0 for non-leader ranks to eliminate redundant dispatching
Updated workspace_shapes() method:
- Integrates the dispatcher optimization into workspace calculation
- Ensures memory allocation reflects the optimized dispatch pattern

Algorithm Details

def _get_effective_num_dispatchers(self) -> int:
    if tp_size <= 1:
        return self.num_dispatchers  # Use all dispatchers
    
    if tp_rank == 0:  # Leader rank
        return max(1, self.num_dispatchers // tp_size)
    
    return 0  # Non-leader ranks don't dispatch

Performance Impact

TP Size	Communication Reduction	Dispatcher Allocation
TP = 1	1x (no change)	All dispatchers
TP = 2	2x reduction	Leader: 50%, Others: 0%
TP = 4	4x reduction	Leader: 25%, Others: 0%
TP = 8	8x reduction	Leader: 12.5%, Others: 0%

Benefits

Reduced Communication Overhead: Eliminates redundant token dispatching across TP ranks
Improved Scalability: Performance gains increase with higher TP parallelism
Backward Compatibility: No impact on single TP configurations or existing APIs
Memory Efficiency: Optimized workspace allocation based on actual dispatch needs

Implementation Features

Robust Edge Case Handling: Guarantees minimum 1 dispatcher for stability
Clear Documentation: Comprehensive docstrings explaining behavior
Efficient Logic Flow: Early return for simple cases, clear separation of concerns
Safe Calculations: Explicit boundary checks and defensive programming

Testing Considerations

The optimization maintains functional correctness while improving performance:

Single TP configurations work unchanged
Multi-TP configurations reduce communication without affecting model accuracy
Memory allocation scales appropriately with the optimization

github-actions · 2025-08-15T17:16:51Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request optimizes MoE token dispatching in tensor parallel configurations by restricting token dispatch to the leader rank. The implementation introduces a new method _get_effective_num_dispatchers to control the number of dispatchers based on the tensor parallel rank, which correctly reduces workspace allocation for non-leader ranks. The change is well-implemented and should deliver the described performance benefits. I have one suggestion to move a local import to the top level for better performance and code style.

gemini-code-assist · 2025-08-15T17:18:00Z

vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py

For improved performance and code clarity, it's recommended to move this import to the top of the file. Local imports can introduce overhead, especially if this method is called in a performance-sensitive path. Please remove the local import from this method and add from vllm.distributed import get_tensor_model_parallel_world_size, get_tensor_model_parallel_rank to the file-level imports.

ok，solved.

mgoin · 2025-08-15T21:27:25Z

cc @varun-sundar-rabindranath

varun-sundar-rabindranath · 2025-08-16T04:28:34Z

Hi @skyloevil . Thanks you for the fix. AFAICT, the TP ranks still participate in the all2alls no ? If that is the case, then we might end up in a spot where the workspaces aren't big enough to accommodate all the incoming tokens. Can you confirm that this doesn't happen.

Ways to test / debug:

Monitoring the expert_num_tokens in expert_tokens_meta in the apply call should give you a fair idea.
I usually test it with,

lm_eval --model local-completions --tasks gsm8k --model_args model=${MODEL},base_url=http://127.0.0.1:{PORT}/v1/completions,num_concurrent=30,max_retries=3 --limit 100

besides testing for accuracy, it is quite adept in catching corner cases.

Try setting VLLM_MOE_DP_CHUNK_SIZE to a low value like 8

If multiple TP ranks are involved in the all2alls the solution could be as simple as to make only TP=0 participate in the all2all. A slightly complicated but optimal solution would be to dispatch only a part of the tokens from each TP rank. Note that the second approach is required only for DeepEP all2all kernels. PPLX kernels do this automatically when TP > 1.

Also, can you share any perf numbers. Thanks 🙌

…h optimization - Add debug logs to track FP8 quantization method configuration and Deep GEMM support detection - Implement detailed logging in BatchedTritonOrDeepGemmExperts for initialization and runtime selection - Add verification logs for _get_effective_num_dispatchers method to validate tensor parallel dispatch optimization - Include environment-controlled logging (VLLM_LOG_MOE_DISPATCH) for PR vllm-project#22993 verification - Enable tracing of complete MoE expert selection pipeline from quantization to execution - All debug logs use appropriate log levels (DEBUG for detailed tracing, INFO for key verification points) These logs enable developers to: 1. Verify MoE dispatch optimization works correctly in TP > 1 scenarios 2. Trace why specific expert implementations are selected 3. Debug expert_num_tokens allocation and workspace sizing issues 4. Validate that leader/non-leader rank dispatch logic functions as expected Signed-off-by: zitian.zhao <[email protected]>

…llm-project#24362) Signed-off-by: Benjamin Chislett <[email protected]>

…m-project#24335) Signed-off-by: Mohan Kumar Kumar <[email protected]> Signed-off-by: mohankku <[email protected]>

Signed-off-by: Isotr0py <[email protected]>

Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

…roject#24324)

…x reasoning token count (vllm-project#24285) Signed-off-by: Ye (Charlotte) Qi <[email protected]>

…24370) Signed-off-by: elvischenv <[email protected]>

… scopes (vllm-project#24265) Co-authored-by: Bangsheng Tang <[email protected]>

Signed-off-by: Woosuk Kwon <[email protected]>

…m-project#24372) Signed-off-by: youkaichao <[email protected]>

Signed-off-by: Benji Beck <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…llm-project#24380) Signed-off-by: Aaron Pham <[email protected]>

Signed-off-by: zjy0516 <[email protected]> Signed-off-by: Jiangyun Zhu <[email protected]> Co-authored-by: Luka Govedič <[email protected]>

…4330) Signed-off-by: Saman Keon <[email protected]>

Signed-off-by: Ye (Charlotte) Qi <[email protected]>

Signed-off-by: Jee Jee Li <[email protected]>

gemini-code-assist bot reviewed Aug 15, 2025

View reviewed changes

skyloevil force-pushed the optimize/moe-dispatch-efficiency branch 2 times, most recently from 6cad37f to 76782d4 Compare August 15, 2025 17:28

mgoin requested review from mgoin and tlrmchlsmth August 15, 2025 21:27

This comment was marked as outdated.

Sign in to view

skyloevil force-pushed the optimize/moe-dispatch-efficiency branch from 20bd6ed to 533759c Compare August 17, 2025 17:12

skyloevil requested review from robertgshaw2-redhat and yewentao256 as code owners August 30, 2025 07:31

benchislett and others added 17 commits September 13, 2025 12:40

Add @benchislett to codeowner for spec decode and structured outputs (v…

e4d4f14

…llm-project#24362) Signed-off-by: Benjamin Chislett <[email protected]>

[Bugfix] Avoid uninitialized usage of azp_val when AZP is false. (vll…

1aa47a2

…m-project#24335) Signed-off-by: Mohan Kumar Kumar <[email protected]> Signed-off-by: mohankku <[email protected]>

[Bugfix] Fix broken deepseek fp8 TP weights loading (vllm-project#24367)

cf41ffb

Signed-off-by: Isotr0py <[email protected]>

[Bugfix] Fix test_mixtral_moe (vllm-project#24371)

20b9ebc

Lora bias(enable_lora_bias) deprecate warning (vllm-project#24339)

8f8a223

Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

[Fix] [gpt-oss] fix non-tool calling path for chat completion (vllm-p…

8e8661c

…roject#24324)

[Frontend][Responses API] Support reporting tool output tokens and fi…

5c672b7

…x reasoning token count (vllm-project#24285) Signed-off-by: Ye (Charlotte) Qi <[email protected]>

[Bugfix] Fix unstable silu_mul+nvfp4 quant fusion test (vllm-project#…

eaa48c1

…24370) Signed-off-by: elvischenv <[email protected]>

break execute_model in gpu_model_runner into sub-functions for custom…

4ef2101

… scopes (vllm-project#24265) Co-authored-by: Bangsheng Tang <[email protected]>

[V0 deprecation] Deprecate V0 Neuron backend (vllm-project#21159)

15ba556

Signed-off-by: Woosuk Kwon <[email protected]>

[attention][DCP] use AttentionImpl.need_to_return_lse_for_decode (vll…

a80d287

…m-project#24372) Signed-off-by: youkaichao <[email protected]>

Migrate Qwen2 inputs to TensorSchema (vllm-project#23475)

7ff6370

Signed-off-by: Benji Beck <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

[CI][Fix] deterministic seed for flaky CI runs on structured outputs (v…

8cac128

…llm-project#24380) Signed-off-by: Aaron Pham <[email protected]>

[Benchmark] add benchmark for custom activation op (vllm-project#23908)

5123beb

Signed-off-by: zjy0516 <[email protected]> Signed-off-by: Jiangyun Zhu <[email protected]> Co-authored-by: Luka Govedič <[email protected]>

QWEN3 Thinking Fused MoE kernels Optimization configs (vllm-project#2…

d93f13b

…4330) Signed-off-by: Saman Keon <[email protected]>

[Misc] collect flashinfer version in collect_env.py (vllm-project#24378)

9e8af02

Signed-off-by: Ye (Charlotte) Qi <[email protected]>

[Bugfix] Fix Qwen3-coder moe tuned config (vllm-project#24072)

ebcecac

Signed-off-by: Jee Jee Li <[email protected]>

skyloevil requested review from 22quinn, ProExpertProg, alexm-redhat, chaunceyjiang, comaniac, houseroad, youkaichao, zhuohan123 and zou3519 as code owners September 13, 2025 04:41

github-project-automation bot added this to Tool Calling Sep 13, 2025

Merge branch 'main' into optimize/moe-dispatch-efficiency

6594d05

mergify bot removed the tpu Related to Google TPUs label Sep 13, 2025

skyloevil closed this Sep 13, 2025

github-project-automation bot moved this to Done in Tool Calling Sep 13, 2025

skyloevil mentioned this pull request Sep 14, 2025

[MoE] Optimize fused MoE dispatch and workspace allocation with TP leader restriction #24831

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimize MoE Token Dispatch for Tensor Parallel Configurations #22993

Optimize MoE Token Dispatch for Tensor Parallel Configurations #22993

Uh oh!

skyloevil commented Aug 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Uh oh!

skyloevil Aug 15, 2025 •

edited

Loading

Uh oh!

mgoin commented Aug 15, 2025

Uh oh!

varun-sundar-rabindranath commented Aug 16, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

123 participants

Uh oh!

Optimize MoE Token Dispatch for Tensor Parallel Configurations #22993

Optimize MoE Token Dispatch for Tensor Parallel Configurations #22993

Uh oh!

Conversation

skyloevil commented Aug 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Optimize MoE Token Dispatch for Tensor Parallel Configurations

Summary

Problem

Solution

Core Changes

Algorithm Details

Performance Impact

Benefits

Implementation Features

Testing Considerations

Uh oh!

github-actions bot commented Aug 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

skyloevil Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgoin commented Aug 15, 2025

Uh oh!

varun-sundar-rabindranath commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

123 participants

skyloevil commented Aug 15, 2025 •

edited by github-actions bot

Loading

skyloevil Aug 15, 2025 •

edited

Loading

varun-sundar-rabindranath commented Aug 16, 2025 •

edited

Loading