-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[EPLB] Optimize EPLB for Async Rearrange Experts #22179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an asynchronous, layer-by-layer execution optimization for EPLB, which is a great step towards reducing latency. The core idea is solid, but the implementation has several critical correctness issues, mainly related to the new asynchronous logic. I've identified bugs like undefined variables, incorrect method calls, and improper use of asyncio primitives. Addressing these will be crucial for the stability of this feature. I've also pointed out some maintainability improvements, such as translating comments and using the standard logger.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
hmellor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the optimisation.
Please wait for #20562 and add the new config to the new EPLBConfig class.
OK, after I prepare this PR and test result, I will rebase and add the new config to the new EPLBConfig class |
|
This pull request has merge conflicts that must be resolved before it can be |
jiangkuaixue123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the above issues. Thank you.
Signed-off-by: David Chen <[email protected]> Co-authored-by: jiangkuaixue123 <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: jiangkuaixue123 <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: jiangkuaixue123 <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: jiangkuaixue123 <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: jiangkuaixue123 <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: jiangkuaixue123 <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: jiangkuaixue123 <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: jiangkuaixue123 <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: jiangkuaixue123 <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: jiangkuaixue123 <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: jiangkuaixue123 <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: jiangkuaixue123 <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: jiangkuaixue123 <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: jiangkuaixue123 <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]>
Signed-off-by: David Chen <[email protected]>
|
@simon-mo CI has passed, thanks for your support |
tlrmchlsmth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add coverage for async eplb to our nightly tests -- could be a follow up PR, but we should add it to the following DSv2 lite test (and the Qwen test when eplb gets turned on)
vllm/.buildkite/test-pipeline.yaml
Lines 1334 to 1350 in 4de8786
| - label: DeepSeek V2-Lite Accuracy | |
| timeout_in_minutes: 60 | |
| gpu: h100 | |
| optional: true | |
| num_gpus: 4 | |
| working_dir: "/vllm-workspace" | |
| commands: | |
| - bash .buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_eplb.sh 0.25 200 8010 | |
| - label: Qwen3-30B-A3B-FP8-block Accuracy | |
| timeout_in_minutes: 60 | |
| gpu: h100 | |
| optional: true | |
| num_gpus: 4 | |
| working_dir: "/vllm-workspace" | |
| commands: | |
| - bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep.sh 0.8 200 8020 |
| assert num_physical_experts == ep_size * num_local_physical_experts | ||
| # A buffer to hold the expert weights in one layer during the exchange. | ||
| # NOTE: Currently we assume the same weights across different layers | ||
| # have the same shape. | ||
|
|
||
| is_unchanged, is_received_locally, experts_recv_loc = move_to_buffer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me what this comment is describing
| assert num_physical_experts == ep_size * num_local_physical_experts | |
| # A buffer to hold the expert weights in one layer during the exchange. | |
| # NOTE: Currently we assume the same weights across different layers | |
| # have the same shape. | |
| is_unchanged, is_received_locally, experts_recv_loc = move_to_buffer( | |
| assert num_physical_experts == ep_size * num_local_physical_experts | |
| # A buffer to hold the expert weights in one layer during the exchange. | |
| # NOTE: Currently we assume the same weights across different layers | |
| # have the same shape. | |
| is_unchanged, is_received_locally, experts_recv_loc = move_to_buffer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless comment, will removed with a follow up PR
Signed-off-by: David Chen <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]> Signed-off-by: Runkai Tao <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
thx, I will open a follow up PR to coverage this |
Signed-off-by: David Chen <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Signed-off-by: David Chen <[email protected]> Co-authored-by: SunChenxiang123 <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
issue #20805 #24116
pr #18343
EPLB Execution
They jointly provided the initial code
5 Key Pitfalls in EPLB Asynchronous Core Implementation
CUDA Stream must be bound to Device:
After creating an asynchronous thread, the initialized torch.cuda.Stream must be explicitly bound to the target GPU device. Otherwise, mismatched devices will cause asynchronous task execution to fail.
Separate “move_to_buffer” and “move_to_workspace”:
move_to_buffer: should be executed in asynchronous threads to hide data transfer latency. move_to_workspace: involves copying cached weights into the working area. Since this is an on-device operation (only ~300–400 µs), it must be executed synchronously. However, the performance impact is minimal.
Layer-level synchronization: wait for P2P completion:
After each layer executes move to buffer, all GPUs must wait until their P2P (point-to-point) send/receive operations finish. This ensures that subsequent steps won’t fail due to missing or incomplete data.
All Reduce must be coupled with Rank synchronization to avoid deadlocks:
While waiting for P2P operations, an All Reduce is required to ensure all Ranks (GPUs) reach the same condition. This All Reduce must be executed at every rank interaction, otherwise desynchronized progress across GPUs may cause deadlocks. To reduce overhead, All Reduce should only be performed during the Rearrange phase, not in other stages.
Update expert weights after each layer’s move_from_buffer:
After move_from_buffer at each layer, expert weights must be updated. Otherwise, computations may use stale weights, causing mapping inconsistencies and leading to accuracy issues.
Test Plan
Test Result
Benchmark TP=8,EP=8, 8XH800 80G dataset: MMLU-pro,
Due to resource limitations, we only measured up to EP8. A larger EP size with no-blocking EPLB would yield greater benefits:
benchmark test without eplb:
accuracy:
benchmark:
profile:

non-blocking EPLB:
(Optional) Documentation Update
Co-authored-by: jiangkuaixue123 [email protected]
Co-authored-by: SunChenxiang123 [email protected]