Skip to content

Conversation

@david6666666
Copy link
Contributor

@david6666666 david6666666 commented Aug 4, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

issue #20805 #24116
pr #18343
EPLB Execution

  • Parallelize the rearrangement algorithm (calculating new expert mapping, not the communication)
  • Shuffle one layer at once and use multiple steps, to lower the impact on inter-token latency
  • Investigate should we pre-allocate expert weight buffer used for transferring
  • Take locality into consideration in expert weight transmission, e.g. prioritize transferring to GPUs on the same node
  • Use cuda.Stream() asynchronously moves the weight to the buffer

They jointly provided the initial code
5 Key Pitfalls in EPLB Asynchronous Core Implementation

  1. CUDA Stream must be bound to Device:
    After creating an asynchronous thread, the initialized torch.cuda.Stream must be explicitly bound to the target GPU device. Otherwise, mismatched devices will cause asynchronous task execution to fail.

  2. Separate “move_to_buffer” and “move_to_workspace”:
    move_to_buffer: should be executed in asynchronous threads to hide data transfer latency. move_to_workspace: involves copying cached weights into the working area. Since this is an on-device operation (only ~300–400 µs), it must be executed synchronously. However, the performance impact is minimal.

  3. Layer-level synchronization: wait for P2P completion:
    After each layer executes move to buffer, all GPUs must wait until their P2P (point-to-point) send/receive operations finish. This ensures that subsequent steps won’t fail due to missing or incomplete data.

  4. All Reduce must be coupled with Rank synchronization to avoid deadlocks:
    While waiting for P2P operations, an All Reduce is required to ensure all Ranks (GPUs) reach the same condition. This All Reduce must be executed at every rank interaction, otherwise desynchronized progress across GPUs may cause deadlocks. To reduce overhead, All Reduce should only be performed during the Rearrange phase, not in other stages.

  5. Update expert weights after each layer’s move_from_buffer:
    After move_from_buffer at each layer, expert weights must be updated. Otherwise, computations may use stale weights, causing mapping inconsistencies and leading to accuracy issues.

Test Plan

pytest tests/distributed/test_eplb_execute.py
pytest tests/distributed/test_eplb_spec_decode.py

Test Result

Benchmark TP=8,EP=8, 8XH800 80G dataset: MMLU-pro,
Due to resource limitations, we only measured up to EP8. A larger EP size with no-blocking EPLB would yield greater benefits:
benchmark test without eplb:

vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
     --tensor-parallel-size 8 \
     --enable-expert-parallel \
     --gpu-memory-utilization 0.8 \
     --trust-remote-code

vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
     --tensor-parallel-size 8 \
     --enable-expert-parallel \
     --enable-eplb \
     --eplb-config '{"window_size":1000,"step_interval":3000}' \
     --gpu-memory-utilization 0.8 \
     --trust-remote-code

vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
     --tensor-parallel-size 8 \
     --enable-expert-parallel \
     --enable-eplb \
     --eplb-config '{"window_size":1000,"step_interval":3000,"use_async":"true"}' \
     --gpu-memory-utilization 0.8 \
     --trust-remote-code

accuracy:

python tests/evals/gsm8k/gsm8k_eval.py --port 8000
w/o EPLB:
Running GSM8K evaluation: 1319 questions, 5-shot
Evaluating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [01:07<00:00, 19.67it/s]
Results:
Accuracy: 0.901
Invalid responses: 0.000    
Total latency: 67.070 s     
Questions per second: 19.666

w/ EPLB, main:
Running GSM8K evaluation: 1319 questions, 5-shot
Evaluating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [01:22<00:00, 16.00it/s]
Results:
Accuracy: 0.902
Invalid responses: 0.000    
Total latency: 82.434 s     
Questions per second: 16.001

w/ EPLB, this PR:
Running GSM8K evaluation: 1319 questions, 5-shot
Evaluating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [01:20<00:00, 16.46it/s]
Results:
Accuracy: 0.908
Invalid responses: 0.000    
Total latency: 80.133 s     
Questions per second: 16.460


async EPLB + MTP:
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --max-model-len 2048  --trust-remote-code --port 8000 -dp 1 -tp 4 --enable-expert-parallel --gpu-memory-utilization 0.86 --enable-eplb  --eplb-config '{"window_size":200,"step_interval":600,"use_async":"true"}' --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}' 2>&1 | tee ../prof/test_async_mtp.log

Running GSM8K evaluation: 1319 questions, 5-shot
Evaluating: 100%|████████████████████████████████████████████████████████████████| 1319/1319 [01:34<00:00, 14.00it/s]

Results:
Accuracy: 0.864
Invalid responses: 0.000
Total latency: 94.256 s
Questions per second: 13.994

benchmark:

vllm bench serve  \
--backend vllm   \
--model Qwen/Qwen3-235B-A22B-Instruct-2507-FP8   \
--endpoint /v1/completions   \
--dataset-name custom   \
--dataset-path /root/.cache/cwq/datasets/mmlu_pro_test.jsonl   \
--max-concurrency 64 --num-prompt 1000
w/o EPLB:
============ Serving Benchmark Result ============
Successful requests:                     1000
Maximum request concurrency:             64
Benchmark duration (s):                  109.47
Total input tokens:                      69309
Total generated tokens:                  231400
Request throughput (req/s):              9.13
Output token throughput (tok/s):         2113.74
Peak output token throughput (tok/s):    2367.00
Peak concurrent requests:                112.00
Total Token throughput (tok/s):          2746.85
---------------Time to First Token----------------
Mean TTFT (ms):                          93.20
Median TTFT (ms):                        61.91
P99 TTFT (ms):                           310.75
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.11
Median TPOT (ms):                        29.18
P99 TPOT (ms):                           29.96
---------------Inter-token Latency----------------
Mean ITL (ms):                           29.10
Median ITL (ms):                         28.52
P99 ITL (ms):                            38.27
==================================================

w/ EPLB, main:
============ Serving Benchmark Result ============
Successful requests:                     1000
Maximum request concurrency:             64
Benchmark duration (s):                  131.87
Total input tokens:                      69309
Total generated tokens:                  231534
Request throughput (req/s):              7.58
Output token throughput (tok/s):         1755.72
Peak output token throughput (tok/s):    2068.00
Peak concurrent requests:                113.00
Total Token throughput (tok/s):          2281.29
---------------Time to First Token----------------
Mean TTFT (ms):                          97.56
Median TTFT (ms):                        69.89
P99 TTFT (ms):                           348.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          35.04
Median TPOT (ms):                        33.06
P99 TPOT (ms):                           63.84
---------------Inter-token Latency----------------
Mean ITL (ms):                           35.07
Median ITL (ms):                         32.17
P99 ITL (ms):                            45.01
==================================================

w/ EPLB, this PR:
============ Serving Benchmark Result ============
Successful requests:                     1000
Maximum request concurrency:             64
Benchmark duration (s):                  129.85
Total input tokens:                      69309
Total generated tokens:                  230828
Request throughput (req/s):              7.70
Output token throughput (tok/s):         1777.61
Peak output token throughput (tok/s):    2047.00
Peak concurrent requests:                114.00
Total Token throughput (tok/s):          2311.35
---------------Time to First Token----------------
Mean TTFT (ms):                          100.83
Median TTFT (ms):                        71.17
P99 TTFT (ms):                           350.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          34.47
Median TPOT (ms):                        33.27
P99 TPOT (ms):                           46.01
---------------Inter-token Latency----------------
Mean ITL (ms):                           34.51
Median ITL (ms):                         32.28
P99 ITL (ms):                            45.79
==================================================

profile:
non-blocking EPLB:
image

(Optional) Documentation Update

Co-authored-by: jiangkuaixue123 [email protected]
Co-authored-by: SunChenxiang123 [email protected]

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an asynchronous, layer-by-layer execution optimization for EPLB, which is a great step towards reducing latency. The core idea is solid, but the implementation has several critical correctness issues, mainly related to the new asynchronous logic. I've identified bugs like undefined variables, incorrect method calls, and improper use of asyncio primitives. Addressing these will be crucial for the stability of this feature. I've also pointed out some maintainability improvements, such as translating comments and using the standard logger.

@github-actions
Copy link

github-actions bot commented Aug 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Member

@hmellor hmellor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the optimisation.

Please wait for #20562 and add the new config to the new EPLBConfig class.

@david6666666
Copy link
Contributor Author

david6666666 commented Aug 6, 2025

Thank you for the optimisation.

Please wait for #20562 and add the new config to the new EPLBConfig class.

OK, after I prepare this PR and test result, I will rebase and add the new config to the new EPLBConfig class

@mergify
Copy link

mergify bot commented Aug 7, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @david6666666.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 7, 2025
@david6666666
Copy link
Contributor Author

@hmellor @abmfy the implementation is ready, PTAL, thanks. wait for #20562 and add the new config to the new EPLBConfig class and rebase

Copy link

@jiangkuaixue123 jiangkuaixue123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the above issues. Thank you.

@david6666666 david6666666 changed the title [Performance] EPLB Execution Optimization [Performance][EPLB] EPLB Execution Optimization Aug 22, 2025
david6666666 and others added 15 commits November 24, 2025 10:50
Signed-off-by: David Chen <[email protected]>

Co-authored-by: jiangkuaixue123 <[email protected]>

Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]>

Co-authored-by: jiangkuaixue123 <[email protected]>

Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]>

Co-authored-by: jiangkuaixue123 <[email protected]>

Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]>

Co-authored-by: jiangkuaixue123 <[email protected]>

Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]>

Co-authored-by: jiangkuaixue123 <[email protected]>

Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]>

Co-authored-by: jiangkuaixue123 <[email protected]>

Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]>

Co-authored-by: jiangkuaixue123 <[email protected]>

Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]>

Co-authored-by: jiangkuaixue123 <[email protected]>

Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]>

Co-authored-by: jiangkuaixue123 <[email protected]>

Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]>

Co-authored-by: jiangkuaixue123 <[email protected]>

Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]>

Co-authored-by: jiangkuaixue123 <[email protected]>

Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]>

Co-authored-by: jiangkuaixue123 <[email protected]>

Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]>

Co-authored-by: jiangkuaixue123 <[email protected]>

Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]>

Co-authored-by: jiangkuaixue123 <[email protected]>

Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]>
Signed-off-by: David Chen <[email protected]>
@david6666666
Copy link
Contributor Author

@simon-mo CI has passed, thanks for your support

Copy link
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add coverage for async eplb to our nightly tests -- could be a follow up PR, but we should add it to the following DSv2 lite test (and the Qwen test when eplb gets turned on)

- label: DeepSeek V2-Lite Accuracy
timeout_in_minutes: 60
gpu: h100
optional: true
num_gpus: 4
working_dir: "/vllm-workspace"
commands:
- bash .buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_eplb.sh 0.25 200 8010
- label: Qwen3-30B-A3B-FP8-block Accuracy
timeout_in_minutes: 60
gpu: h100
optional: true
num_gpus: 4
working_dir: "/vllm-workspace"
commands:
- bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep.sh 0.8 200 8020

Comment on lines +324 to +329
assert num_physical_experts == ep_size * num_local_physical_experts
# A buffer to hold the expert weights in one layer during the exchange.
# NOTE: Currently we assume the same weights across different layers
# have the same shape.

is_unchanged, is_received_locally, experts_recv_loc = move_to_buffer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what this comment is describing

Suggested change
assert num_physical_experts == ep_size * num_local_physical_experts
# A buffer to hold the expert weights in one layer during the exchange.
# NOTE: Currently we assume the same weights across different layers
# have the same shape.
is_unchanged, is_received_locally, experts_recv_loc = move_to_buffer(
assert num_physical_experts == ep_size * num_local_physical_experts
# A buffer to hold the expert weights in one layer during the exchange.
# NOTE: Currently we assume the same weights across different layers
# have the same shape.
is_unchanged, is_received_locally, experts_recv_loc = move_to_buffer(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless comment, will removed with a follow up PR

@tlrmchlsmth tlrmchlsmth merged commit 2601f18 into vllm-project:main Nov 24, 2025
54 checks passed
lpapavassiliou pushed a commit to lpapavassiliou/vllm that referenced this pull request Nov 24, 2025
RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025
Signed-off-by: David Chen <[email protected]>
Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: Runkai Tao <[email protected]>
MatthewBonanni pushed a commit to MatthewBonanni/vllm that referenced this pull request Nov 24, 2025
@david6666666
Copy link
Contributor Author

We should add coverage for async eplb to our nightly tests -- could be a follow up PR, but we should add it to the following DSv2 lite test (and the Qwen test when eplb gets turned on)

- label: DeepSeek V2-Lite Accuracy
timeout_in_minutes: 60
gpu: h100
optional: true
num_gpus: 4
working_dir: "/vllm-workspace"
commands:
- bash .buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_eplb.sh 0.25 200 8010
- label: Qwen3-30B-A3B-FP8-block Accuracy
timeout_in_minutes: 60
gpu: h100
optional: true
num_gpus: 4
working_dir: "/vllm-workspace"
commands:
- bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep.sh 0.8 200 8020

thx, I will open a follow up PR to coverage this

bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eplb ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.