-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Revert "[Redo] #26368 (#28771)" #29121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request reverts a previous commit that introduced a performance regression by changing token ID representations from list[int] to np.ndarray. The revert seems mostly correct, but I've identified a critical issue in vllm/v1/worker/gpu_model_runner.py where the logic for handling speculative decoding outputs in asynchronous scheduling mode was not fully reverted. This could lead to incorrect outputs. My review includes a suggested fix for this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
vllm/vllm/v1/worker/gpu_model_runner.py
Lines 2498 to 2500 in 97db3b6
| if self.input_batch.prev_sampled_token_ids is None: | |
| assert sampled_token_ids.shape[-1] == 1 | |
| self.input_batch.prev_sampled_token_ids = sampled_token_ids |
In the async scheduling branch, prev_sampled_token_ids is only populated when it is None (lines 2498-2500), and sample_tokens no longer clears it between iterations. After the first batch this condition remains false, so later iterations never refresh the cached sampled tokens. When _prepare_input_ids scatters cached tokens for requests that span iterations, it will reuse stale data from the first iteration, producing incorrect inputs whenever async scheduling processes multiple decode steps.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Head branch was pushed to by a user without write access
This reverts commit 98b4d38. Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
Signed-off-by: Jialin Ouyang <[email protected]>
…#29121) Signed-off-by: Jialin Ouyang <[email protected]> Signed-off-by: LuminolT <[email protected]>
…#29121) Signed-off-by: Jialin Ouyang <[email protected]>
…#29121) Signed-off-by: Jialin Ouyang <[email protected]>
…#29121) Signed-off-by: Jialin Ouyang <[email protected]> Signed-off-by: Runkai Tao <[email protected]>
…#29121) Signed-off-by: Jialin Ouyang <[email protected]>
…#29121) Signed-off-by: Jialin Ouyang <[email protected]>
|
TBH, this change breaked oot again and again. lol, any plan to make it strong enough before merge? |
|
Pretty sure this is final now |
@wangxiyuan n00b question, what's oot? Before landing in the first place, we ensured all CI tests are passed. I'm wondering if there's anything we should do to further improve CI coverage. Thanks for sharing the context. |
|
OOT stands for Out-Of-Tree. In this case it refers to plugin packages for alternative hardware backends, such as vllm-ascend. Since those backends have their own model runner, any change to the interface of the inputs and outputs may break them. |
Thanks for the explanation! |
|
Never mind. Usually breaking change for oot is acceptable. But this kind of change(do-revert-redo-revert) is really rare |
…#29121) Signed-off-by: Jialin Ouyang <[email protected]>
1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <[email protected]> Co-authored-by: wangli <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: hfadzxy <[email protected]> Co-authored-by: wangli <[email protected]> Co-authored-by: hfadzxy <[email protected]>
1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <[email protected]> Co-authored-by: wangli <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: hfadzxy <[email protected]> Co-authored-by: wangli <[email protected]> Co-authored-by: hfadzxy <[email protected]>
Purpose
This reverts commit 98b4d38.
Per @gshtras reported offline, the original PR introduced throughput regression. From @gshtras,
And we've confirmed the regression locally, and try to fix forward in #29033, but it did not help.
Our learning here is that: Although replacing list[int] to np.ndarray could avoid bumping gc allocation count, but the conversion overhead is way too big and would regress the throughput e2e.
Test Plan & Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.