-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[2/N] Elastic EP Milestone 2: Integrating NIXL-EP #29630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
support request serving during scaling up/down Signed-off-by: Yongji Wu <[email protected]> misc fixes Signed-off-by: Yongji Wu <[email protected]> minor fix Signed-off-by: Yongji Wu <[email protected]> minor fix Signed-off-by: Yongji Wu <[email protected]> scaling test: 2->4->2 Signed-off-by: Yongji Wu <[email protected]> tiny fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> small fix Signed-off-by: Yongji Wu <[email protected]> small fix Signed-off-by: Yongji Wu <[email protected]> small fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]>
Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]>
8ba94c2 to
297bec9
Compare
💡 Codex Reviewvllm/vllm/distributed/stateless_coordinator.py Lines 285 to 289 in 8ba94c2
When a stateless group receives tensors on the CPU path, the data is dropped: ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and complex feature: elastic scaling for expert parallelism (EP) by integrating the NIXL-EP kernel. The changes are extensive, touching many core components of vLLM's distributed infrastructure, including communication primitives, model execution, and configuration management. The core of this feature is the introduction of stateless communication groups, which allows for dynamic reconfiguration of the cluster topology without requiring a full restart. A state machine has been implemented to orchestrate the scaling operations (both up and down), which is a robust approach for such a complex distributed process. The implementation also includes optimizations for new worker startup, where they receive model weights from peers instead of loading from disk. Overall, the changes appear well-architected and the logic is consistent across the various components. I have found one high-severity issue related to a debug print statement that should be removed.
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit: wrong PR
Purpose
This is the 2nd PR towards milestone 2 of elastic EP. The 1st PR is #26278.
This PR integrates NIXL EP kernel.
NIXL EP is a EP kernel based on NIXL's device API. It provides elastic scaling capabilities, enabling dynamic addition and removal of processes (ranks) during runtime, without the need to destroy and recreate communicators during scaling up/down.
Test Plan
Performance testing script:
Qwen/Qwen3-30B-A3B-Thinking-2507-FP8model on 8xH100 with EP=8.vllm bench serve \ --model $MODEL_NAME \ --host $HOST \ --port $PORT \ --dataset-name random \ --random-input-len 128 \ --random-output-len 512 \ --num-prompts 512Test Result
CC List
@ruisearch42 @tlrmchlsmth @kouroshHakha