Revert "[Redo] #26368 (#28771)" #29121

Jialin · 2025-11-20T21:55:20Z

Purpose

This reverts commit 98b4d38.

Per @gshtras reported offline, the original PR introduced throughput regression. From @gshtras,



main
Avg latency: 222.78142010899805 seconds
main + revert 186352b27
Avg latency: 192.86235034199976 seconds

And we've confirmed the regression locally, and try to fix forward in #29033, but it did not help.

Our learning here is that: Although replacing list[int] to np.ndarray could avoid bumping gc allocation count, but the conversion overhead is way too big and would regress the throughput e2e.

Test Plan & Test Result

vllm bench latency --model meta-llama/Llama-3.1-8B-Instruct --dtype bfloat16 --batch-size 3000 --input-len 128 --output-len 2048 -tp 8 --num-iters-warmup 1 --num-iters 3

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request reverts a previous commit that introduced a performance regression by changing token ID representations from list[int] to np.ndarray. The revert seems mostly correct, but I've identified a critical issue in vllm/v1/worker/gpu_model_runner.py where the logic for handling speculative decoding outputs in asynchronous scheduling mode was not fully reverted. This could lead to incorrect outputs. My review includes a suggested fix for this issue.

vllm/v1/worker/gpu_model_runner.py

chatgpt-codex-connector

💡 Codex Review

vllm/vllm/v1/worker/gpu_model_runner.py

Lines 2498 to 2500 in 97db3b6

    
           if self.input_batch.prev_sampled_token_ids is None: 
        
               assert sampled_token_ids.shape[-1] == 1 
        
               self.input_batch.prev_sampled_token_ids = sampled_token_ids

Async scheduler reuses stale prev_sampled_token_ids

In the async scheduling branch, prev_sampled_token_ids is only populated when it is None (lines 2498-2500), and sample_tokens no longer clears it between iterations. After the first batch this condition remains false, so later iterations never refresh the cached sampled tokens. When _prepare_input_ids scatters cached tokens for requests that span iterations, it will reuse stale data from the first iteration, producing incorrect inputs whenever async scheduling processes multiple decode steps.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

This reverts commit 98b4d38. Signed-off-by: Jialin Ouyang <[email protected]>

Signed-off-by: Jialin Ouyang <[email protected]>

…#29121) Signed-off-by: Jialin Ouyang <[email protected]> Signed-off-by: LuminolT <[email protected]>

…#29121) Signed-off-by: Jialin Ouyang <[email protected]>

…#29121) Signed-off-by: Jialin Ouyang <[email protected]> Signed-off-by: Runkai Tao <[email protected]>

…#29121) Signed-off-by: Jialin Ouyang <[email protected]>

wangxiyuan · 2025-11-29T08:02:12Z

TBH, this change breaked oot again and again. lol, any plan to make it strong enough before merge?

DarkLight1337 · 2025-11-29T08:55:50Z

Pretty sure this is final now

Jialin · 2025-11-30T12:09:32Z

TBH, this change breaked oot again and again. lol, any plan to make it strong enough before merge?

@wangxiyuan n00b question, what's oot? Before landing in the first place, we ensured all CI tests are passed. I'm wondering if there's anything we should do to further improve CI coverage. Thanks for sharing the context.

DarkLight1337 · 2025-11-30T12:15:19Z

OOT stands for Out-Of-Tree. In this case it refers to plugin packages for alternative hardware backends, such as vllm-ascend. Since those backends have their own model runner, any change to the interface of the inputs and outputs may break them.

Jialin · 2025-11-30T12:37:17Z

OOT stands for Out-Of-Tree. In this case it refers to plugin packages for alternative hardware backends, such as vllm-ascend. Since those backends have their own model runner, any change to the interface of the inputs and outputs may break them.

Thanks for the explanation!

wangxiyuan · 2025-12-01T01:20:10Z

Never mind. Usually breaking change for oot is acceptable. But this kind of change(do-revert-redo-revert) is really rare

…#29121) Signed-off-by: Jialin Ouyang <[email protected]>

1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <[email protected]> Co-authored-by: wangli <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: hfadzxy <[email protected]> Co-authored-by: wangli <[email protected]> Co-authored-by: hfadzxy <[email protected]>

Jialin requested review from 22quinn, ApostaC, NickLucche, WoosukKwon, alexm-redhat, benchislett, heheda12345, houseroad, luccafong, njhill, robertgshaw2-redhat and ywang96 as code owners November 20, 2025 21:55

mergify bot added speculative-decoding v1 tpu Related to Google TPUs kv-connector labels Nov 20, 2025

gemini-code-assist bot reviewed Nov 20, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 20, 2025

View reviewed changes

Jialin requested a review from zhuohan123 November 20, 2025 22:15

zhuohan123 approved these changes Nov 20, 2025

View reviewed changes

zhuohan123 enabled auto-merge (squash) November 20, 2025 23:43

zhuohan123 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 20, 2025

auto-merge was automatically disabled November 21, 2025 01:08
Head branch was pushed to by a user without write access

Jialin added 5 commits November 20, 2025 19:05

Revert "[Redo] vllm-project#26368 (vllm-project#28771)"

bcb8a0f

This reverts commit 98b4d38. Signed-off-by: Jialin Ouyang <[email protected]>

fix merge conflict

fc89e61

Signed-off-by: Jialin Ouyang <[email protected]>

simply the code

b0c973f

Signed-off-by: Jialin Ouyang <[email protected]>

revert blank line

aab2160

Signed-off-by: Jialin Ouyang <[email protected]>

Fix async scheduling

f92d39f

Signed-off-by: Jialin Ouyang <[email protected]>

Jialin force-pushed the revert branch from bf2df4f to f92d39f Compare November 21, 2025 03:05

DarkLight1337 approved these changes Nov 21, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) November 21, 2025 03:51

vllm-bot merged commit 30b9c67 into vllm-project:main Nov 21, 2025
44 of 46 checks passed

LuminolT pushed a commit to LuminolT/vllm that referenced this pull request Nov 21, 2025

Revert "[Redo] vllm-project#26368 (vllm-project#28771)" (vllm-project…

70a2a7b

…#29121) Signed-off-by: Jialin Ouyang <[email protected]> Signed-off-by: LuminolT <[email protected]>

ywang96 pushed a commit to ywang96/vllm that referenced this pull request Nov 23, 2025

Revert "[Redo] vllm-project#26368 (vllm-project#28771)" (vllm-project…

152d63c

…#29121) Signed-off-by: Jialin Ouyang <[email protected]>

lpapavassiliou pushed a commit to lpapavassiliou/vllm that referenced this pull request Nov 24, 2025

Revert "[Redo] vllm-project#26368 (vllm-project#28771)" (vllm-project…

e59b1c6

…#29121) Signed-off-by: Jialin Ouyang <[email protected]>

RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025

Revert "[Redo] vllm-project#26368 (vllm-project#28771)" (vllm-project…

c214cd5

…#29121) Signed-off-by: Jialin Ouyang <[email protected]> Signed-off-by: Runkai Tao <[email protected]>

bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025

Revert "[Redo] vllm-project#26368 (vllm-project#28771)" (vllm-project…

e18261f

…#29121) Signed-off-by: Jialin Ouyang <[email protected]>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

Revert "[Redo] vllm-project#26368 (vllm-project#28771)" (vllm-project…

3e72ac9

…#29121) Signed-off-by: Jialin Ouyang <[email protected]>

Potabk mentioned this pull request Nov 29, 2025

[Main] Upgrade vllm commit to 2025_12_01 vllm-project/vllm-ascend#4527

Closed

wangxiyuan mentioned this pull request Dec 1, 2025

upgrade vLLM to main vllm-project/vllm-ascend#4608

Merged

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

Revert "[Redo] vllm-project#26368 (vllm-project#28771)" (vllm-project…

6d19d9e

…#29121) Signed-off-by: Jialin Ouyang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Revert "[Redo] #26368 (#28771)" #29121

Revert "[Redo] #26368 (#28771)" #29121

Uh oh!

Jialin commented Nov 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

wangxiyuan commented Nov 29, 2025

Uh oh!

DarkLight1337 commented Nov 29, 2025

Uh oh!

Jialin commented Nov 30, 2025

Uh oh!

DarkLight1337 commented Nov 30, 2025

Uh oh!

Jialin commented Nov 30, 2025

Uh oh!

wangxiyuan commented Dec 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	if self.input_batch.prev_sampled_token_ids is None:
	assert sampled_token_ids.shape[-1] == 1
	self.input_batch.prev_sampled_token_ids = sampled_token_ids

Uh oh!

Revert "[Redo] #26368 (#28771)" #29121

Revert "[Redo] #26368 (#28771)" #29121

Uh oh!

Conversation

Jialin commented Nov 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan & Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

wangxiyuan commented Nov 29, 2025

Uh oh!

DarkLight1337 commented Nov 29, 2025

Uh oh!

Jialin commented Nov 30, 2025

Uh oh!

DarkLight1337 commented Nov 30, 2025

Uh oh!

Jialin commented Nov 30, 2025

Uh oh!

wangxiyuan commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Jialin commented Nov 20, 2025 •

edited by github-actions bot

Loading

wangxiyuan commented Dec 1, 2025 •

edited

Loading