Skip to content

Conversation

@joerunde
Copy link
Collaborator

@joerunde joerunde commented Oct 28, 2025

Description

Upgrades vllm to 0.11.1, adding backwards compatibility code where necessary.

This PR:

  • Updates the default vllm install to 0.11.1
  • Retains the lower bound of 0.10.2
  • Adds a new entry in the backwards compatibility tests to maintain test coverage of 0.11.0
  • Changes the uv.lock settings to install vllm from source instead of from cuda wheels
  • Bumps fms-mo to a dev version past 0.7.0 because 0.7.0 has a bug when running on cpu
  • Cleans out the GHA runner before image builds since we were running out of disk space

There was one really fun change here where the type of sampled_token_ids changed, but was then changed back for 0.12.0.

TODO: maybe we should get a new fms-mo release first

@github-actions
Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
@joerunde joerunde added the ready Runs the full CI test suite. Only add to PRs once ready to merge to limit public GHA usage label Oct 28, 2025
@joerunde joerunde changed the title ✨ vllm main support for upcoming 0.111.1 release ✨ vllm main support for upcoming 0.11.1 release Nov 4, 2025
@joerunde joerunde requested a review from ckadner as a code owner December 4, 2025 20:29
@joerunde joerunde removed the ready Runs the full CI test suite. Only add to PRs once ready to merge to limit public GHA usage label Dec 4, 2025
Signed-off-by: Joe Runde <[email protected]>
]

[tool.uv.sources]
vllm = { git = "https://github.com/vllm-project/vllm", rev = "v0.11.1" }
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing vllm this way (with VLLM_TARGET_DEVICE=empty) leaves out extra cuda-only dependencies from the uv.lock, since the published vllm wheels on pypi are only built for cuda.

@joerunde joerunde changed the title ✨ vllm main support for upcoming 0.11.1 release ✨ vllm support for 0.11.1 release Dec 5, 2025
Copy link
Collaborator

@tjohnson31415 tjohnson31415 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow you make backwards compatiblity elegant

Comment on lines 111 to 115
extra_args = {}
if "structured_output_request_ids" in dataclass_fields(SchedulerOutput):
extra_args["structured_output_request_ids"] = {}
if "grammar_bitmask" in dataclass_fields(SchedulerOutput):
extra_args["grammar_bitmask"] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we could just import and use _get_extra_args() from the spyre_worker to reduce code duplication.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private imports!!

but yeah, for a test file that's probably fine

) -> None:
"""Raises if this request is unsupported on this platform"""

# TODO: fix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a TODO for this PR to fix before merging?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh- maybe 🤔

I think I put the TODO in because the lazy import was suuuper ugly, but I do think the import has to stay lazy or we'll hit a circular import :(. The TODO here might be to just remove the TODO and replace with a comment about why this is the way it is

@tjohnson31415
Copy link
Collaborator

TODO: There is still a problem with running qunatized models. I'm not sure what's going on there, as neither the torch version nor modeling code changed, but we're getting an error from torch

This PR bumps fms-model-optimizer to 0.7.0 in uv.lock. I confirmed the quantized model tests fail after upgrading 0.6.0 -> 0.7.0. Installing fms-mo from main resolved the torch error in my dev pod.

@joerunde
Copy link
Collaborator Author

joerunde commented Dec 6, 2025

Alright @tjohnson31415, looks like we are 🟢 for now. Thanks for the fms-mo hint, I validated that fms-mo 0.7.0 still works on spyre and it's just the cpu execution that's broken. I've bumped here to the latest main commit, which also appears to work fine on spyre too.

Let's talk on Monday- maybe we should get a new official fms-mo release instead of pinning a commit, and then I'm not entirely sure with our current release cadence whether we'd want to bump the actual vllm install to 0.11.1 or flip this around and just add a compatibility test for 0.11.1 and keep the uv.lock at 0.11.0. Then either way we should get the currently-good set of spyre unit tests run on this before merging

Copy link
Collaborator

@yannicks1 yannicks1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Comment on lines -621 to +630
cached_request_data = CachedRequestData(
req_ids=req_ids,
resumed_from_preemption=False,
new_token_ids=new_token_ids,
new_block_ids=new_block_ids,
num_computed_tokens=num_computed_tokens,
)
cached_request_data = CachedRequestData.make_empty()
cached_request_data.req_ids = req_ids
cached_request_data.new_block_ids = new_block_ids
cached_request_data.new_token_ids = new_token_ids
cached_request_data.num_computed_tokens = num_computed_tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my understanding: what is the motivation for this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More fields were added to this dataclass, so normally we would have to do the checks on dartaclass_fields to inject empty kwargs into the initializer call here. But, this class offers a make_empty() that initializes everything with default values. So we use that instead and then only set the values that we care about, that way we don't have any backwards compatibility cleanup to worry about later.

@yannicks1
Copy link
Collaborator

we have one last error to fix for CP:
TypeError: vllm.v1.core.sched.output.SchedulerOutput() got multiple values for keyword argument 'free_encoder_mm_hashes'

Signed-off-by: Yannick Schnider <[email protected]>
@yannicks1
Copy link
Collaborator

@joerunde I fixed the failing test. hope you don't mind that I pushed to your branch, but thought it saves us some GHA time and you can hit merge as soon as you wake up:)

@joerunde
Copy link
Collaborator Author

joerunde commented Dec 8, 2025

Thanks @yannicks1!

@joerunde
Copy link
Collaborator Author

joerunde commented Dec 8, 2025

bot:test
MARKERS="spyre and cb and not multi"

1 similar comment
@joerunde
Copy link
Collaborator Author

joerunde commented Dec 8, 2025

bot:test
MARKERS="spyre and cb and not multi"

Copy link
Collaborator

@tjohnson31415 tjohnson31415 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Yeah let's chat about the vLLM pin and the fms-mo release

@joerunde
Copy link
Collaborator Author

joerunde commented Dec 8, 2025

bot:test
MARKERS="spyre and chunked_prefill and not multi"

Copy link
Collaborator

@maxdebayser maxdebayser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can merge once the tests pass.

@joerunde
Copy link
Collaborator Author

joerunde commented Dec 9, 2025

bot:test
MARKERS="spyre and cb and not multi"

@joerunde
Copy link
Collaborator Author

joerunde commented Dec 9, 2025

bot:test
MARKERS="spyre and cb and not multi"

1 similar comment
@joerunde
Copy link
Collaborator Author

joerunde commented Dec 9, 2025

bot:test
MARKERS="spyre and cb and not multi"

@joerunde
Copy link
Collaborator Author

joerunde commented Dec 9, 2025

The continuous batching tests passed on our bot test, and I was able to get the chunked prefill tests working on a dev pod. (The graph comparison tests still fail on chunked prefill with an old version of aftu on bot:test runs 😢 )

Test results
$ pytest tests -m "spyre and not quantized and not multi and chunked_prefill" -x -v --forked -k "not aftu"
===================================================================== test session starts =====================================================================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.6.0 -- /home/senuser/repo/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /home/senuser/repo
configfile: pyproject.toml
plugins: forked-1.6.0, timeout-2.3.1, mock-3.15.1, asyncio-1.2.0, anyio-4.11.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function
collected 811 items / 784 deselected / 27 selected                                                                                                            

tests/e2e/test_logits_processors.py::test_custom_logits_processor[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-warmup_shapes([(64, 20, 4)])-max_model_len(512)-max_num_seqs(4)-cp] PASSED [  3%]
tests/e2e/test_spyre_async_llm.py::test_abort[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-warmup_shapes([(64, 20, 4)])-max_model_len(512)-max_num_seqs(4)-max_num_batched_tokens(128)-cp-RequestOutputKind.DELTA] PASSED [  7%]
tests/e2e/test_spyre_async_llm.py::test_abort[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-warmup_shapes([(64, 20, 4)])-max_model_len(512)-max_num_seqs(4)-max_num_batched_tokens(128)-cp-RequestOutputKind.FINAL_ONLY] PASSED [ 11%]
tests/e2e/test_spyre_basic.py::test_max_model_len_override[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-warmup_shapes([(64, 20, 4)])-cp] PASSED     [ 14%]
tests/e2e/test_spyre_cp_scheduler_steps.py::test_prefill_tkv_too_big[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-None-128-128-2] PASSED            [ 18%]
tests/e2e/test_spyre_cp_scheduler_steps.py::test_requests_exceed_batch_tkv_limit[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-None-128-128-2] PASSED [ 22%]
tests/e2e/test_spyre_cp_scheduler_steps.py::test_prefill_use_more_than_available_blocks[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-2-128-128-2] PASSED [ 25%]
tests/e2e/test_spyre_cp_scheduler_steps.py::test_single_cp_prefill[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-None-128-514-2] SKIPPED (sendnn...) [ 29%]
tests/e2e/test_spyre_cp_scheduler_steps.py::test_single_cp_prefill[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-None-128-1024-2] PASSED             [ 33%]
tests/e2e/test_spyre_cp_scheduler_steps.py::test_cp_prefill_interleave1[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-None-128-2048-2] PASSED        [ 37%]
tests/e2e/test_spyre_cp_scheduler_steps.py::test_cp_prefill_no_interleave[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-None-128-2048-2] PASSED      [ 40%]
tests/e2e/test_spyre_cp_scheduler_steps.py::test_cp_prefill_interleave2[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-None-128-2048-2] PASSED        [ 44%]
tests/e2e/test_chunked_prefill.py::test_chunked_prefill_correctness[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-max_model_len(512)-max_num_seqs(4)-case_Ia] PASSED [ 48%]
tests/e2e/test_chunked_prefill.py::test_chunked_prefill_correctness[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-max_model_len(512)-max_num_seqs(4)-case_Ib] PASSED [ 51%]
tests/e2e/test_chunked_prefill.py::test_chunked_prefill_correctness[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-max_model_len(512)-max_num_seqs(4)-case_II] PASSED [ 55%]
tests/e2e/test_chunked_prefill.py::test_chunked_prefill_correctness[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-max_model_len(512)-max_num_seqs(4)-case_III] PASSED [ 59%]
tests/e2e/test_sampling_params.py::test_spyre_batch1_logit_bias[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-warmup_shapes([(64, 20, 4)])-max_model_len(512)-max_num_seqs(4)-cp] PASSED [ 62%]
tests/e2e/test_sampling_params.py::test_spyre_batch1_min_tokens[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-warmup_shapes([(64, 20, 4)])-max_model_len(512)-max_num_seqs(4)-cp] PASSED [ 66%]
tests/e2e/test_sampling_params.py::test_spyre_batch1_min_p[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-warmup_shapes([(64, 20, 4)])-max_model_len(512)-max_num_seqs(4)-cp] PASSED [ 70%]
tests/e2e/test_spyre_basic.py::test_output[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-warmup_shapes([(64, 20, 4)])-max_model_len(512)-max_num_seqs(4)-cp-TP(1)] PASSED [ 74%]
tests/e2e/test_spyre_basic.py::test_batch_handling[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-warmup_shapes([(64, 20, 4)])-max_model_len(512)-max_num_seqs(4)-cp] PASSED [ 77%]
tests/e2e/test_spyre_max_new_tokens.py::test_output[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-warmup_shapes([(64, 20, 4)])-max_model_len(512)-max_num_seqs(4)-cp-True] PASSED [ 81%]
tests/e2e/test_spyre_max_new_tokens.py::test_output[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-warmup_shapes([(64, 20, 4)])-max_model_len(512)-max_num_seqs(4)-cp-False] PASSED [ 85%]
tests/e2e/test_spyre_seed.py::test_seed[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-warmup_shapes([(64, 20, 4)])-max_model_len(512)-max_num_seqs(4)-cp-42-0.1] XFAIL [ 88%]
tests/e2e/test_spyre_seed.py::test_seed[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-warmup_shapes([(64, 20, 4)])-max_model_len(512)-max_num_seqs(4)-cp-42-1.0] XFAIL [ 92%]
tests/e2e/test_spyre_stagger_basic.py::test_stagger_output[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-warmup_shapes([(64, 20, 4)])-max_model_len(512)-max_num_seqs(4)-cp-TP(1)] PASSED [ 96%]
tests/e2e/test_spyre_online.py::test_openai_serving[ibm-ai-platform/micro-g3.3-8b-instruct-1b-sendnn-warmup_shapes([(64, 20, 4)])-max_model_len(512)-max_num_seqs(4)-cp-TP(1)] PASSED [100%]

===================================== 24 passed, 1 skipped, 784 deselected, 2 xfailed, 60 warnings in 1214.60s (0:20:14) ======================================

@joerunde joerunde merged commit e834cc7 into main Dec 9, 2025
22 checks passed
@joerunde joerunde deleted the 0.11.1-support branch December 9, 2025 18:42
Comment on lines +366 to +370

print("\n\n\n\n\t\tNUM BLOCKS:", num_blocks)
print("\t\tBLOCK SIZE:", self.kv_cache_specs['block_size'])
print("\t\tNUM KV HEADS:", self.kv_cache_specs['num_kv_heads'])
print("\t\tHEAD DIM:", self.kv_cache_specs['head_dim'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print("\n\n\n\n\t\tNUM BLOCKS:", num_blocks)
print("\t\tBLOCK SIZE:", self.kv_cache_specs['block_size'])
print("\t\tNUM KV HEADS:", self.kv_cache_specs['num_kv_heads'])
print("\t\tHEAD DIM:", self.kv_cache_specs['head_dim'])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rip lol

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants