-
Notifications
You must be signed in to change notification settings - Fork 624
Upgrade vLLM to v0.11.2 #4368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
+1,098
−185
Closed
Upgrade vLLM to v0.11.2 #4368
Changes from all commits
Commits
Show all changes
40 commits
Select commit
Hold shift + click to select a range
cf27c04
fix break by vllm commit: Support LoRA with speculative decoding #21068
leo-pony a86a6c1
[Hybrid] Pass kernel block size to builders #27753
leo-pony f1f20cc
fix the main-to-main break by:[Bug] Fix env string 0 same to True #28159
leo-pony 2d83516
[Core] Async scheduling + structured outputs compatibility#26866
leo-pony ffc519f
fix structure output break bduring adapt to llm: Async scheduling + s…
leo-pony 03cb9fb
fix structured outputs compatibility
22dimensions e1bbbd8
fix mtp breaks in modelrunner and format fix
leo-pony cb507ae
fix mypy issues
leo-pony 3e1bbe8
model runner execute model support v0.11.0 branch
leo-pony 399e165
fix format issue
leo-pony 2dd522e
update to releases/v0.11.1
22dimensions 5a50e2a
fix break by vllm:[BugFix][VL] Fix FA selection on Qwen2.5-VL #27790
leo-pony 7f1cd59
fix scheduler
22dimensions 8399298
skip ut, nightly, v0.11.0
leo-pony 0c76f8e
Skip 1th e2e full
leo-pony f1f4161
skip has tested cases
leo-pony 5bad0f3
Fix vllm break:Support LoRA with speculative decoding:#21068
leo-pony 0764b27
remove skip of nightly a2
leo-pony 7502030
fix the deepseek mtp break
leo-pony eb2927f
Add comments for deepseek torchair mtp break, vllm PR:27922
leo-pony 47cc8de
Add comments fRestore full test cases
leo-pony 11f6d4e
Enable torchair test in single card full test
leo-pony 4197a11
just to trigger ci test
leo-pony 97df642
fix vllm break by: Enable sequence parallelism matching w/o custom op…
leo-pony 58303c8
vllm break of PR:https://github.com/vllm-project/vllm/pull/24794
leo-pony dabd793
adapt qwen3 next
22dimensions ccb837b
fix the undefine splitting_ops in test_sp_for_qwen3_moe
leo-pony 9a95847
make light test take effect
leo-pony f9f9bf2
fix vllm break: Fix backend selection for encoder-only models (#28534)
leo-pony 40905c8
fix vllm break:Refactor CUDA attention backend selection logic#24794
leo-pony 0a81671
fix break of vllm:Rename clashing method names for vLLM model protoco…
leo-pony 940fb90
fix lint error
leo-pony 3cc8278
fix vllm break: Avoid bytecode hook and simplify TorchCompileWrapperW…
leo-pony 9111fc7
skip A3-nightly test
leo-pony 3a07d19
replace VisionPatchEmbed for better performance
shen-shanshan c987667
fix
shen-shanshan 6f5cfe1
fix vllm break: [Model] Pass mm_features directly into get_mrope_inpu…
leo-pony c03dfef
fix vllm break:[Misc] Make SchedulerConfig.max_model_len init-only #2…
leo-pony d977d4f
fix vllm break:[V1] Support MP Executor for multi node distributed in…
leo-pony fcfb3a0
update version
wangxiyuan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -561,29 +561,47 @@ def schedule(self) -> SchedulerOutput: | |
| scheduled_spec_decode_tokens, | ||
| req_to_new_blocks, | ||
| ) | ||
| scheduled_requests = (scheduled_new_reqs + scheduled_running_reqs + | ||
| scheduled_resumed_reqs) | ||
| structured_output_request_ids, grammar_bitmask = ( | ||
| self.get_grammar_bitmask(scheduled_requests, | ||
| scheduled_spec_decode_tokens)) | ||
| scheduler_output = SchedulerOutput( | ||
| scheduled_new_reqs=new_reqs_data, | ||
| scheduled_cached_reqs=cached_reqs_data, | ||
| num_scheduled_tokens=num_scheduled_tokens, | ||
| total_num_scheduled_tokens=total_num_scheduled_tokens, | ||
| scheduled_spec_decode_tokens=scheduled_spec_decode_tokens, | ||
| scheduled_encoder_inputs=scheduled_encoder_inputs, | ||
| num_common_prefix_blocks=num_common_prefix_blocks, | ||
| # finished_req_ids is an existing state in the scheduler, | ||
| # instead of being newly scheduled in this step. | ||
| # It contains the request IDs that are finished in between | ||
| # the previous and the current steps. | ||
| finished_req_ids=self.finished_req_ids, | ||
| free_encoder_mm_hashes=self.encoder_cache_manager. | ||
| get_freed_mm_hashes(), | ||
| structured_output_request_ids=structured_output_request_ids, | ||
| grammar_bitmask=grammar_bitmask, | ||
| ) | ||
| if vllm_version_is("0.11.0"): | ||
| scheduled_requests = (scheduled_new_reqs + scheduled_running_reqs + | ||
| scheduled_resumed_reqs) | ||
| structured_output_request_ids, grammar_bitmask = ( | ||
| self.get_grammar_bitmask(scheduled_requests, | ||
| scheduled_spec_decode_tokens)) | ||
| scheduler_output = SchedulerOutput( | ||
| scheduled_new_reqs=new_reqs_data, | ||
| scheduled_cached_reqs=cached_reqs_data, | ||
| num_scheduled_tokens=num_scheduled_tokens, | ||
| total_num_scheduled_tokens=total_num_scheduled_tokens, | ||
| scheduled_spec_decode_tokens=scheduled_spec_decode_tokens, | ||
| scheduled_encoder_inputs=scheduled_encoder_inputs, | ||
| num_common_prefix_blocks=num_common_prefix_blocks, | ||
| # finished_req_ids is an existing state in the scheduler, | ||
| # instead of being newly scheduled in this step. | ||
| # It contains the request IDs that are finished in between | ||
| # the previous and the current steps. | ||
| finished_req_ids=self.finished_req_ids, | ||
| free_encoder_mm_hashes=self.encoder_cache_manager. | ||
| get_freed_mm_hashes(), | ||
| structured_output_request_ids=structured_output_request_ids, | ||
| grammar_bitmask=grammar_bitmask, | ||
| ) | ||
| else: | ||
| scheduler_output = SchedulerOutput( | ||
| scheduled_new_reqs=new_reqs_data, | ||
| scheduled_cached_reqs=cached_reqs_data, | ||
| num_scheduled_tokens=num_scheduled_tokens, | ||
| total_num_scheduled_tokens=total_num_scheduled_tokens, | ||
| scheduled_spec_decode_tokens=scheduled_spec_decode_tokens, | ||
| scheduled_encoder_inputs=scheduled_encoder_inputs, | ||
| num_common_prefix_blocks=num_common_prefix_blocks, | ||
| # finished_req_ids is an existing state in the scheduler, | ||
| # instead of being newly scheduled in this step. | ||
| # It contains the request IDs that are finished in between | ||
| # the previous and the current steps. | ||
| finished_req_ids=self.finished_req_ids, | ||
| free_encoder_mm_hashes=self.encoder_cache_manager. | ||
| get_freed_mm_hashes(), | ||
| ) | ||
|
Comment on lines
+564
to
+604
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to another file in this PR, this change introduces significant code duplication for instantiating scheduler_output_kwargs = {
"scheduled_new_reqs": new_reqs_data,
"scheduled_cached_reqs": cached_reqs_data,
"num_scheduled_tokens": num_scheduled_tokens,
"total_num_scheduled_tokens": total_num_scheduled_tokens,
"scheduled_spec_decode_tokens": scheduled_spec_decode_tokens,
"scheduled_encoder_inputs": scheduled_encoder_inputs,
"num_common_prefix_blocks": num_common_prefix_blocks,
# finished_req_ids is an existing state in the scheduler,
# instead of being newly scheduled in this step.
# It contains the request IDs that are finished in between
# the previous and the current steps.
"finished_req_ids": self.finished_req_ids,
"free_encoder_mm_hashes": self.encoder_cache_manager.
get_freed_mm_hashes(),
}
if vllm_version_is("0.11.0"):
scheduled_requests = (scheduled_new_reqs + scheduled_running_reqs +
scheduled_resumed_reqs)
structured_output_request_ids, grammar_bitmask = (
self.get_grammar_bitmask(scheduled_requests,
scheduled_spec_decode_tokens))
scheduler_output_kwargs["structured_output_request_ids"] = structured_output_request_ids
scheduler_output_kwargs["grammar_bitmask"] = grammar_bitmask
scheduler_output = SchedulerOutput(**scheduler_output_kwargs) |
||
|
|
||
| # NOTE(Kuntai): this function is designed for multiple purposes: | ||
| # 1. Plan the KV cache store | ||
|
|
||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The introduction of the
if vllm_version_is("0.11.0"):block has led to significant code duplication for instantiatingSchedulerOutput. The two branches are nearly identical, differing only by two arguments (structured_output_request_idsandgrammar_bitmask). This duplication makes the code harder to maintain and increases the risk of introducing bugs if changes are not applied to both branches.