Skip to content

Conversation

@Jialin
Copy link
Collaborator

@Jialin Jialin commented Nov 20, 2025

Purpose

This reverts commit 98b4d38.

Per @gshtras reported offline, the original PR introduced throughput regression. From @gshtras,



main
Avg latency: 222.78142010899805 seconds
main + revert 186352b27
Avg latency: 192.86235034199976 seconds

And we've confirmed the regression locally, and try to fix forward in #29033, but it did not help.

Our learning here is that: Although replacing list[int] to np.ndarray could avoid bumping gc allocation count, but the conversion overhead is way too big and would regress the throughput e2e.

Test Plan & Test Result

vllm bench latency --model meta-llama/Llama-3.1-8B-Instruct --dtype bfloat16 --batch-size 3000 --input-len 128 --output-len 2048 -tp 8 --num-iters-warmup 1 --num-iters 3

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reverts a previous commit that introduced a performance regression by changing token ID representations from list[int] to np.ndarray. The revert seems mostly correct, but I've identified a critical issue in vllm/v1/worker/gpu_model_runner.py where the logic for handling speculative decoding outputs in asynchronous scheduling mode was not fully reverted. This could lead to incorrect outputs. My review includes a suggested fix for this issue.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

if self.input_batch.prev_sampled_token_ids is None:
assert sampled_token_ids.shape[-1] == 1
self.input_batch.prev_sampled_token_ids = sampled_token_ids

P1 Badge Async scheduler reuses stale prev_sampled_token_ids

In the async scheduling branch, prev_sampled_token_ids is only populated when it is None (lines 2498-2500), and sample_tokens no longer clears it between iterations. After the first batch this condition remains false, so later iterations never refresh the cached sampled tokens. When _prepare_input_ids scatters cached tokens for requests that span iterations, it will reuse stale data from the first iteration, producing incorrect inputs whenever async scheduling processes multiple decode steps.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@Jialin Jialin requested a review from zhuohan123 November 20, 2025 22:15
@zhuohan123 zhuohan123 enabled auto-merge (squash) November 20, 2025 23:43
@zhuohan123 zhuohan123 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 20, 2025
auto-merge was automatically disabled November 21, 2025 01:08

Head branch was pushed to by a user without write access

This reverts commit 98b4d38.

Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) November 21, 2025 03:51
@vllm-bot vllm-bot merged commit 30b9c67 into vllm-project:main Nov 21, 2025
44 of 46 checks passed
LuminolT pushed a commit to LuminolT/vllm that referenced this pull request Nov 21, 2025
ywang96 pushed a commit to ywang96/vllm that referenced this pull request Nov 23, 2025
lpapavassiliou pushed a commit to lpapavassiliou/vllm that referenced this pull request Nov 24, 2025
RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025
bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
@wangxiyuan
Copy link
Contributor

TBH, this change breaked oot again and again. lol, any plan to make it strong enough before merge?

@DarkLight1337
Copy link
Member

Pretty sure this is final now

@Jialin
Copy link
Collaborator Author

Jialin commented Nov 30, 2025

TBH, this change breaked oot again and again. lol, any plan to make it strong enough before merge?

@wangxiyuan n00b question, what's oot? Before landing in the first place, we ensured all CI tests are passed. I'm wondering if there's anything we should do to further improve CI coverage. Thanks for sharing the context.

@DarkLight1337
Copy link
Member

OOT stands for Out-Of-Tree. In this case it refers to plugin packages for alternative hardware backends, such as vllm-ascend. Since those backends have their own model runner, any change to the interface of the inputs and outputs may break them.

@Jialin
Copy link
Collaborator Author

Jialin commented Nov 30, 2025

OOT stands for Out-Of-Tree. In this case it refers to plugin packages for alternative hardware backends, such as vllm-ascend. Since those backends have their own model runner, any change to the interface of the inputs and outputs may break them.

Thanks for the explanation!

@wangxiyuan
Copy link
Contributor

wangxiyuan commented Dec 1, 2025

Never mind. Usually breaking change for oot is acceptable. But this kind of change(do-revert-redo-revert) is really rare

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
wangxiyuan added a commit to vllm-project/vllm-ascend that referenced this pull request Dec 2, 2025
1. fix vllm-project/vllm#28542
The model structure modifications we involved in are:
     - Qwen2.5-VL(still exist some patch)
     - Qwen2-VL
     - Qwen2
     - DeepSeek series
     - Qwen-moe series
2. fix vllm-project/vllm#29121
   the output token now  type changed from np to `list[list[int]]`

3. fix vllm-project/vllm#29262
    `xformers` backend for multimodal now has been deprecated
4. fix vllm-project/vllm#29342

5. fix vllm-project/vllm#28579
6. fix vllm-project/vllm#28718
7. fix vllm-project/vllm#28665
8. fix vllm-project/vllm#26847
vllm introduced the `optimization-level`, some default config has been
changed, and the param `--enforce-eager` has been deprecated
9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple
for sampler.
10. fix vllm-project/vllm#29471 we'll remove the
related patch to avoid this kind of error.

Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: wangli <[email protected]>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangli <[email protected]>
Signed-off-by: hfadzxy <[email protected]>
Co-authored-by: wangli <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
ChenCangtao pushed a commit to ChenCangtao/vllm-ascend that referenced this pull request Dec 3, 2025
1. fix vllm-project/vllm#28542
The model structure modifications we involved in are:
     - Qwen2.5-VL(still exist some patch)
     - Qwen2-VL
     - Qwen2
     - DeepSeek series
     - Qwen-moe series
2. fix vllm-project/vllm#29121
   the output token now  type changed from np to `list[list[int]]`

3. fix vllm-project/vllm#29262
    `xformers` backend for multimodal now has been deprecated
4. fix vllm-project/vllm#29342

5. fix vllm-project/vllm#28579
6. fix vllm-project/vllm#28718
7. fix vllm-project/vllm#28665
8. fix vllm-project/vllm#26847
vllm introduced the `optimization-level`, some default config has been
changed, and the param `--enforce-eager` has been deprecated
9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple
for sampler.
10. fix vllm-project/vllm#29471 we'll remove the
related patch to avoid this kind of error.

Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: wangli <[email protected]>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangli <[email protected]>
Signed-off-by: hfadzxy <[email protected]>
Co-authored-by: wangli <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants