Skip to content

Conversation

@libertyeagle
Copy link

@libertyeagle libertyeagle commented Nov 28, 2025

Purpose

This is the 2nd PR towards milestone 2 of elastic EP. The 1st PR is #26278.
This PR integrates NIXL EP kernel.
NIXL EP is a EP kernel based on NIXL's device API. It provides elastic scaling capabilities, enabling dynamic addition and removal of processes (ranks) during runtime, without the need to destroy and recreate communicators during scaling up/down.

Test Plan

  • Basic elastic scaling up/down functionality with vLLM Elastic EP
  • Performance benchmark

Performance testing script:
Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 model on 8xH100 with EP=8.

vllm bench serve \
    --model $MODEL_NAME \
    --host $HOST \
    --port $PORT \
    --dataset-name random \
    --random-input-len 128 \
    --random-output-len 512 \
    --num-prompts 512

Test Result

============ Serving Benchmark Result ============                                                                                                                        [5/915]
Successful requests:                     512                                                                                                                                     
Failed requests:                         0                                                                                                                                       
Benchmark duration (s):                  7.84                                                                                                                                    
Total input tokens:                      65536                                                                                                                                   
Total generated tokens:                  262144                                                                                                                                  
Request throughput (req/s):              65.28     
Output token throughput (tok/s):         33425.72  
Peak output token throughput (tok/s):    21263.00  
Peak concurrent requests:                512.00    
Total Token throughput (tok/s):          41782.15  
---------------Time to First Token----------------
Mean TTFT (ms):                          1646.63   
Median TTFT (ms):                        1712.47   
P99 TTFT (ms):                           1740.90   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.04     
Median TPOT (ms):                        11.91     
P99 TPOT (ms):                           13.75     
---------------Inter-token Latency----------------
Mean ITL (ms):                           25.62     
Median ITL (ms):                         23.87    
P99 ITL (ms):                            57.22     
==================================================

CC List

@ruisearch42 @tlrmchlsmth @kouroshHakha


support request serving during scaling up/down

Signed-off-by: Yongji Wu <[email protected]>

misc fixes

Signed-off-by: Yongji Wu <[email protected]>

minor fix

Signed-off-by: Yongji Wu <[email protected]>

minor fix

Signed-off-by: Yongji Wu <[email protected]>

scaling test: 2->4->2

Signed-off-by: Yongji Wu <[email protected]>

tiny fix

Signed-off-by: Yongji Wu <[email protected]>

rebase fix

Signed-off-by: Yongji Wu <[email protected]>

rebase fix

Signed-off-by: Yongji Wu <[email protected]>

rebase fix

Signed-off-by: Yongji Wu <[email protected]>

rebase fix

Signed-off-by: Yongji Wu <[email protected]>

rebase fix

Signed-off-by: Yongji Wu <[email protected]>

small fix

Signed-off-by: Yongji Wu <[email protected]>

small fix

Signed-off-by: Yongji Wu <[email protected]>

small fix

Signed-off-by: Yongji Wu <[email protected]>

rebase fix

Signed-off-by: Yongji Wu <[email protected]>
Signed-off-by: Yongji Wu <[email protected]>

rebase fix

Signed-off-by: Yongji Wu <[email protected]>

rebase fix

Signed-off-by: Yongji Wu <[email protected]>

rebase fix

Signed-off-by: Yongji Wu <[email protected]>

rebase fix

Signed-off-by: Yongji Wu <[email protected]>
@chatgpt-codex-connector
Copy link

💡 Codex Review

tensor = self.device_communicator.recv(
tensor.size(), tensor.dtype, src
)
else:
tensor = self.tcp_store_group.recv(tensor, src)

P1 Badge Preserve tensors when receiving over stateless CPU paths

When a stateless group receives tensors on the CPU path, the data is dropped: recv_tensor_dict assigns the return value of self.tcp_store_group.recv(...) to tensor, but StatelessProcessGroup.recv (vllm/distributed/utils.py:256-262) mutates the buffer in place and returns None. Any CPU tensor received through this branch is therefore stored as None in tensor_dict, corrupting the payload or triggering downstream failures. This occurs whenever elastic/stateless groups communicate non-CUDA tensors or when no device communicator is available.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and complex feature: elastic scaling for expert parallelism (EP) by integrating the NIXL-EP kernel. The changes are extensive, touching many core components of vLLM's distributed infrastructure, including communication primitives, model execution, and configuration management. The core of this feature is the introduction of stateless communication groups, which allows for dynamic reconfiguration of the cluster topology without requiring a full restart. A state machine has been implemented to orchestrate the scaling operations (both up and down), which is a robust approach for such a complex distributed process. The implementation also includes optimizations for new worker startup, where they receive model weights from peers instead of loading from disk. Overall, the changes appear well-architected and the logic is consistent across the various components. I have found one high-severity issue related to a debug print statement that should be removed.

@mergify
Copy link

mergify bot commented Dec 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @libertyeagle.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 1, 2025
Copy link
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: wrong PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants