[2/N] Elastic EP Milestone 2: Integrating NIXL-EP #29630

libertyeagle · 2025-11-28T00:58:26Z

Purpose

This is the 2nd PR towards milestone 2 of elastic EP. The 1st PR is #26278.
This PR integrates NIXL EP kernel.
NIXL EP is a EP kernel based on NIXL's device API. It provides elastic scaling capabilities, enabling dynamic addition and removal of processes (ranks) during runtime, without the need to destroy and recreate communicators during scaling up/down.

Test Plan

Basic elastic scaling up/down functionality with vLLM Elastic EP
Performance benchmark

Performance testing script:
Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 model on 8xH100 with EP=8.

vllm bench serve \
    --model $MODEL_NAME \
    --host $HOST \
    --port $PORT \
    --dataset-name random \
    --random-input-len 128 \
    --random-output-len 512 \
    --num-prompts 512

Test Result

============ Serving Benchmark Result ============                                                                                                                        [5/915]
Successful requests:                     512                                                                                                                                     
Failed requests:                         0                                                                                                                                       
Benchmark duration (s):                  7.84                                                                                                                                    
Total input tokens:                      65536                                                                                                                                   
Total generated tokens:                  262144                                                                                                                                  
Request throughput (req/s):              65.28     
Output token throughput (tok/s):         33425.72  
Peak output token throughput (tok/s):    21263.00  
Peak concurrent requests:                512.00    
Total Token throughput (tok/s):          41782.15  
---------------Time to First Token----------------
Mean TTFT (ms):                          1646.63   
Median TTFT (ms):                        1712.47   
P99 TTFT (ms):                           1740.90   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.04     
Median TPOT (ms):                        11.91     
P99 TPOT (ms):                           13.75     
---------------Inter-token Latency----------------
Mean ITL (ms):                           25.62     
Median ITL (ms):                         23.87    
P99 ITL (ms):                            57.22     
==================================================

CC List

@ruisearch42 @tlrmchlsmth @kouroshHakha

support request serving during scaling up/down Signed-off-by: Yongji Wu <[email protected]> misc fixes Signed-off-by: Yongji Wu <[email protected]> minor fix Signed-off-by: Yongji Wu <[email protected]> minor fix Signed-off-by: Yongji Wu <[email protected]> scaling test: 2->4->2 Signed-off-by: Yongji Wu <[email protected]> tiny fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> small fix Signed-off-by: Yongji Wu <[email protected]> small fix Signed-off-by: Yongji Wu <[email protected]> small fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]>

Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]>

chatgpt-codex-connector · 2025-11-28T01:01:25Z

💡 Codex Review

vllm/vllm/distributed/stateless_coordinator.py

Lines 285 to 289 in 8ba94c2

    
               tensor = self.device_communicator.recv( 
        
                   tensor.size(), tensor.dtype, src 
        
               ) 
        
           else: 
        
               tensor = self.tcp_store_group.recv(tensor, src)

Preserve tensors when receiving over stateless CPU paths

When a stateless group receives tensors on the CPU path, the data is dropped: recv_tensor_dict assigns the return value of self.tcp_store_group.recv(...) to tensor, but StatelessProcessGroup.recv (vllm/distributed/utils.py:256-262) mutates the buffer in place and returns None. Any CPU tensor received through this branch is therefore stored as None in tensor_dict, corrupting the payload or triggering downstream failures. This occurs whenever elastic/stateless groups communicate non-CUDA tensors or when no device communicator is available.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

gemini-code-assist

Code Review

This pull request introduces a significant and complex feature: elastic scaling for expert parallelism (EP) by integrating the NIXL-EP kernel. The changes are extensive, touching many core components of vLLM's distributed infrastructure, including communication primitives, model execution, and configuration management. The core of this feature is the introduction of stateless communication groups, which allows for dynamic reconfiguration of the cluster topology without requiring a full restart. A state machine has been implemented to orchestrate the scaling operations (both up and down), which is a robust approach for such a complex distributed process. The implementation also includes optimizations for new worker startup, where they receive model weights from peers instead of loading from disk. Overall, the changes appear well-architected and the logic is consistent across the various components. I have found one high-severity issue related to a debug print statement that should be removed.

mergify · 2025-12-01T17:28:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @libertyeagle.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tlrmchlsmth

edit: wrong PR

libertyeagle requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, pavanimajety, robertgshaw2-redhat, tjtanaa, tlrmchlsmth, yewentao256, youkaichao and zou3519 as code owners November 28, 2025 00:58

mergify bot added ci/build nvidia rocm Related to AMD ROCm v1 labels Nov 28, 2025

github-project-automation bot added this to NVIDIA Nov 28, 2025

mergify bot added the kv-connector label Nov 28, 2025

NIXL EP

297bec9

Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]> rebase fix Signed-off-by: Yongji Wu <[email protected]>

libertyeagle force-pushed the nixl-ep branch from 8ba94c2 to 297bec9 Compare November 28, 2025 01:01

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

mergify bot added the needs-rebase label Dec 1, 2025

tlrmchlsmth reviewed Dec 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[2/N] Elastic EP Milestone 2: Integrating NIXL-EP #29630

[2/N] Elastic EP Milestone 2: Integrating NIXL-EP #29630

libertyeagle commented Nov 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Dec 1, 2025

Uh oh!

tlrmchlsmth left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[2/N] Elastic EP Milestone 2: Integrating NIXL-EP #29630

Are you sure you want to change the base?

[2/N] Elastic EP Milestone 2: Integrating NIXL-EP #29630

Conversation

libertyeagle commented Nov 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

CC List

Uh oh!

chatgpt-codex-connector bot commented Nov 28, 2025

💡 Codex Review

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Dec 1, 2025

Uh oh!

tlrmchlsmth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

libertyeagle commented Nov 28, 2025 •

edited by github-actions bot

Loading

tlrmchlsmth left a comment •

edited

Loading