-
Notifications
You must be signed in to change notification settings - Fork 31
✨ vllm support for 0.11.1 release #546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
|
👋 Hi! Thank you for contributing to vLLM support on Spyre. Or this can be done with Now you are good to go 🚀 |
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
| ] | ||
|
|
||
| [tool.uv.sources] | ||
| vllm = { git = "https://github.com/vllm-project/vllm", rev = "v0.11.1" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Installing vllm this way (with VLLM_TARGET_DEVICE=empty) leaves out extra cuda-only dependencies from the uv.lock, since the published vllm wheels on pypi are only built for cuda.
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
tjohnson31415
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow you make backwards compatiblity elegant
| extra_args = {} | ||
| if "structured_output_request_ids" in dataclass_fields(SchedulerOutput): | ||
| extra_args["structured_output_request_ids"] = {} | ||
| if "grammar_bitmask" in dataclass_fields(SchedulerOutput): | ||
| extra_args["grammar_bitmask"] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we could just import and use _get_extra_args() from the spyre_worker to reduce code duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private imports!!
but yeah, for a test file that's probably fine
vllm_spyre/platform.py
Outdated
| ) -> None: | ||
| """Raises if this request is unsupported on this platform""" | ||
|
|
||
| # TODO: fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a TODO for this PR to fix before merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh- maybe 🤔
I think I put the TODO in because the lazy import was suuuper ugly, but I do think the import has to stay lazy or we'll hit a circular import :(. The TODO here might be to just remove the TODO and replace with a comment about why this is the way it is
This PR bumps fms-model-optimizer to 0.7.0 in |
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
|
Alright @tjohnson31415, looks like we are 🟢 for now. Thanks for the fms-mo hint, I validated that fms-mo 0.7.0 still works on spyre and it's just the cpu execution that's broken. I've bumped here to the latest main commit, which also appears to work fine on spyre too. Let's talk on Monday- maybe we should get a new official fms-mo release instead of pinning a commit, and then I'm not entirely sure with our current release cadence whether we'd want to bump the actual vllm install to 0.11.1 or flip this around and just add a compatibility test for 0.11.1 and keep the uv.lock at 0.11.0. Then either way we should get the currently-good set of spyre unit tests run on this before merging |
yannicks1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
| cached_request_data = CachedRequestData( | ||
| req_ids=req_ids, | ||
| resumed_from_preemption=False, | ||
| new_token_ids=new_token_ids, | ||
| new_block_ids=new_block_ids, | ||
| num_computed_tokens=num_computed_tokens, | ||
| ) | ||
| cached_request_data = CachedRequestData.make_empty() | ||
| cached_request_data.req_ids = req_ids | ||
| cached_request_data.new_block_ids = new_block_ids | ||
| cached_request_data.new_token_ids = new_token_ids | ||
| cached_request_data.num_computed_tokens = num_computed_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for my understanding: what is the motivation for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More fields were added to this dataclass, so normally we would have to do the checks on dartaclass_fields to inject empty kwargs into the initializer call here. But, this class offers a make_empty() that initializes everything with default values. So we use that instead and then only set the values that we care about, that way we don't have any backwards compatibility cleanup to worry about later.
|
we have one last error to fix for CP: |
Signed-off-by: Yannick Schnider <[email protected]>
|
@joerunde I fixed the failing test. hope you don't mind that I pushed to your branch, but thought it saves us some GHA time and you can hit merge as soon as you wake up:) |
|
Thanks @yannicks1! |
|
bot:test |
1 similar comment
|
bot:test |
tjohnson31415
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Yeah let's chat about the vLLM pin and the fms-mo release
|
bot:test |
maxdebayser
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We can merge once the tests pass.
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
|
bot:test |
Signed-off-by: Joe Runde <[email protected]>
|
bot:test |
1 similar comment
|
bot:test |
|
The continuous batching tests passed on our bot test, and I was able to get the chunked prefill tests working on a dev pod. (The graph comparison tests still fail on chunked prefill with an old version of aftu on bot:test runs 😢 ) Test results |
|
|
||
| print("\n\n\n\n\t\tNUM BLOCKS:", num_blocks) | ||
| print("\t\tBLOCK SIZE:", self.kv_cache_specs['block_size']) | ||
| print("\t\tNUM KV HEADS:", self.kv_cache_specs['num_kv_heads']) | ||
| print("\t\tHEAD DIM:", self.kv_cache_specs['head_dim']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| print("\n\n\n\n\t\tNUM BLOCKS:", num_blocks) | |
| print("\t\tBLOCK SIZE:", self.kv_cache_specs['block_size']) | |
| print("\t\tNUM KV HEADS:", self.kv_cache_specs['num_kv_heads']) | |
| print("\t\tHEAD DIM:", self.kv_cache_specs['head_dim']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rip lol
Description
Upgrades vllm to 0.11.1, adding backwards compatibility code where necessary.
This PR:
There was one really fun change here where the type of
sampled_token_idschanged, but was then changed back for 0.12.0.TODO: maybe we should get a new fms-mo release first