-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Bugfix] Fix error with penalties when speculative decoding and structural output are enabled #26586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix error with penalties when speculative decoding and structural output are enabled #26586
Conversation
…put are enabled Signed-off-by: southfreebird <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a critical bug that causes a RuntimeError when speculative decoding and structured output are used together with logit processors. The root cause is that stale speculative token data could persist in InputBatch if the scheduler drops all draft tokens for a request, leading to out-of-bounds errors in subsequent penalty calculations. The fix correctly ensures that InputBatch.spec_token_ids is always updated, even with an empty list of tokens, thus preventing state corruption. The change is logical, well-commented, and effectively resolves the issue. The implementation looks correct.
vllm/v1/worker/gpu_model_runner.py
Outdated
| # meet the structural schema. This means that | ||
| # scheduler_output.scheduled_spec_decode_tokens might be empty, | ||
| # even when speculative decoding is enabled. So, we moved this line | ||
| # from the 'if' block above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rephrase the comment so that it explains the state of the code and not the change to the code. Comments about moved lines can become less meaningful over time with refactoring
Signed-off-by: southfreebird <[email protected]>
benchislett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks!
|
@southfreebird could you rebase on latest main? |
…dec-and-structural-output
2 critical fixes that cannot be implemented as plugins: 1. Qwen3 tool parser fix (line 523): - Fixes missing opening brace in streaming tool calls - One-line fix: removed buggy condition - Upstreamable: Yes 2. Eagle rejection sampler fix (gpu_model_runner.py): - Cherry-picked from PR vllm-project#26586 (pending upstream merge) - Fixes RuntimeError with Eagle + penalties - Moved spec_token_ids assignment outside if block Plus minor fixes: - DeepSeek R1 reasoning parser import - Config __init__.py ordering See: IN_TREE_MODIFICATIONS.md for details Signed-off-by: Pradyun Ramadorai <[email protected]>
Merged 8 commits from origin/main including: - PR vllm-project#26586: Eagle rejection sampler fix (previously cherry-picked) - LoRA CUDA graph specialization (vllm-project#25914) - Bee-8B VLM model support (vllm-project#27012) - Utilities reorganization (network_utils, async_utils, etc.) - Multiple bug fixes and improvements In-Tree Modifications: - Removed Eagle rejection sampler cherry-pick (now in upstream) - Kept Qwen3 tool parser fix (still needed, line 523) - Only 1 active in-tree modification remaining Plugin Compatibility: - All 10 plugin patches load successfully - No target class changes required - Clean merge with no conflicts Documentation Updates: - Updated IN_TREE_MODIFICATIONS.md (moved Eagle fix to Removed/Obsolete) - Updated CLAUDE.md merge history - Verified clean diff with origin/main (3 files, all documented) Signed-off-by: Pradyun Ramadorai <[email protected]>
…tural output are enabled (vllm-project#26586) Signed-off-by: southfreebird <[email protected]>
…tural output are enabled (vllm-project#26586) Signed-off-by: southfreebird <[email protected]>
…tural output are enabled (vllm-project#26586) Signed-off-by: southfreebird <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>
…tural output are enabled (vllm-project#26586) Signed-off-by: southfreebird <[email protected]> Signed-off-by: 0xrushi <[email protected]>
…tural output are enabled (vllm-project#26586) Signed-off-by: southfreebird <[email protected]> Signed-off-by: 0xrushi <[email protected]>
…tural output are enabled (vllm-project#26586) Signed-off-by: southfreebird <[email protected]>
…tural output are enabled (vllm-project#26586) Signed-off-by: southfreebird <[email protected]>
…tural output are enabled (vllm-project#26586) Signed-off-by: southfreebird <[email protected]>
…tural output are enabled (vllm-project#26586) Signed-off-by: southfreebird <[email protected]>
Fix an error that appears after #19482 when logit processors (such as penalties) are enabled together with speculative decoding and structural output. The example of the error:
Purpose
Test Plan
Test Result