Add prefix caching #586

maxdebayser · 2025-11-30T22:36:34Z

Description

Note: this PR is stacked on #585

This implementation uses the KV Cache Manager to find matching prefixes that satisfy our chunking constraints. Since in the initial version we're going for scheduler agnostic caching, the execute_model() function just executes dummy steps if the chunk was loaded from cache. This is required because the scheduler controls the num_computed_tokens.

Signed-off-by: Max de Bayser <[email protected]>

This commit replaces the management of the req_ids2blocks mapping and the direct management of the block pool by vLLMs single type kv cache manager Signed-off-by: Max de Bayser <[email protected]>

This implementation uses the KV Cache Manager to find matching prefixes that satisfy our chunking constraints. Since in the initial version we're going for scheduler agnostic caching, the execute_model() function just executes dummy steps if the chunk was loaded from cache. This is required because the scheduler controls the num_computed_tokens. Signed-off-by: Max de Bayser <[email protected]>

github-actions · 2025-11-30T22:36:45Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

Now all tests pass with prefix caching disabled. Signed-off-by: Max de Bayser <[email protected]>

- Disable prefix caching during warmup - Don't try to cache chunks while loading from cache is still incomplete - Set left_padding_mask in dummy output Signed-off-by: Max de Bayser <[email protected]>

maxdebayser · 2025-12-01T13:30:08Z

This PR is now in a state where it is running correctly in a very simple test case where I send the same long prompt twice, (on CPU).

maxdebayser · 2025-12-01T13:37:54Z

The next steps are:

test on the spyre device
figure out what to do about the enable-prefix-caching setting which is on by default in vllm
write comprehensive tests

Signed-off-by: Max de Bayser <[email protected]>

yannicks1

looks very good! Put some comments. (I did not run any of the code yet)

vllm_spyre/v1/worker/spyre_input_batch.py

yannicks1 · 2025-12-01T16:33:12Z

vllm_spyre/v1/worker/spyre_model_runner.py

        vllm_config: VllmConfig,
        is_driver_worker: bool,
        rank: int,
+        enable_prefix_caching: bool = False,


seems a bit odd to introduce the enable_prefix_caching arg in the ContinuousBatchingSpyreModelRunner and not in the ChunkedPrefillModelRunner only as CB cannot do prefix caching. But I understand you need it currently in _set_blocks().

A possible solution could be to have an additional arg in _set_blocks e.g. _set_blocks(..., enable_caching=False) which defaults to False. You can then introduce enable_prefix_caching only for the ChunkedPrefillModelRunner and let the default value handle CB.

I like this idea @yannicks1

but actually set_blocks is called only in ContinuousBatchingSpyreModelrunner

Considering that this PR also disabled prefix caching by default and that the only real impact that setting it to enabled while running with the continuous batching model runner is that the block pool will do a bit of [extra work[(https://github.com/vllm-project/vllm/blob/e858bfe05167a3bbb064e283da5a1a7709dee24e/vllm/v1/core/block_pool.py#L326) when chunks are freed, I've removed the weird constructor argument.

vllm_spyre/v1/worker/spyre_model_runner.py

yannicks1 · 2025-12-01T17:29:24Z

vllm_spyre/v1/worker/spyre_model_runner.py

+            computed_blocks = computed_blocks[:usable_blocks]
+            num_cashed_tokens = usable_blocks * self.block_size
+
+            self.block_pool.touch((computed_blocks, ))


Is this also updating vllm's internal counter for the cache hit rate stats?

if not we should also keep these stats up to date for complete benchmark outputs reporting cache hit rates

The BlockPool keeps track of kv cache events if enabled. I didn't do it in the first pass in this PR, but we should definitely reuse this.

vllm_spyre/v1/worker/spyre_model_runner.py

Signed-off-by: Max de Bayser <[email protected]>

Signed-off-by: Yannick Schnider <[email protected]>

Signed-off-by: Max de Bayser <[email protected]>

Signed-off-by: Yannick Schnider <[email protected]>

Signed-off-by: Max de Bayser <[email protected]>

Signed-off-by: Yannick Schnider <[email protected]>

Signed-off-by: Max de Bayser <[email protected]>

…e into sched_agnostic_pc Signed-off-by: Max de Bayser <[email protected]>

maxdebayser · 2025-12-08T18:57:07Z

bot:test

Signed-off-by: Max de Bayser <[email protected]>

maxdebayser · 2025-12-08T21:36:52Z

Caveat: in this current PR we also cache blocks that become full during decoding. But there could be significant numerical differences between blocks generated during prefill and decode.

vllm_spyre/v1/worker/spyre_model_runner.py

tests/conftest.py

Signed-off-by: Max de Bayser <[email protected]>

vllm_spyre/v1/worker/spyre_model_runner.py

Signed-off-by: Max de Bayser <[email protected]>

sducouedic · 2025-12-09T19:01:27Z

It looks great @maxdebayser, especially that everything is reusing upstream KV cache manager. I'll just give it a run after dinner before approving

sducouedic · 2025-12-09T19:46:03Z

vllm_spyre/v1/worker/spyre_model_runner.py

+            self.kv_cache_manager.save_new_computed_blocks(
+                scheduler_request.request_id, computed_blocks)


I believe we can substract the number of reserved blocks here using computed_blocks.
Just did an execution, and I noticed that allocate_new_blocks called after _maybe_load_prefix_from_cache takes into account the number of blocks assigned during save_new_computed_blocks.

But probably we want to do this in another PR and add steps tests for that

Yes, I think you're right, it makes sense to subtract that. But actually @yannicks1 was also proposing to remove the reserved blocks tracking altogether because the volumetric constraint is a tighter constraint anyway.

True, I'm okay with that

Signed-off-by: Max de Bayser <[email protected]>

sducouedic

LGTM!!

maxdebayser added 4 commits November 27, 2025 15:44

Replace the block_pool list with the vLLM Block Pool

5447dc5

Signed-off-by: Max de Bayser <[email protected]>

manage padding blocks outside of block pool

a43e072

Signed-off-by: Max de Bayser <[email protected]>

Switch to Single Type KV Cache manager

6378294

This commit replaces the management of the req_ids2blocks mapping and the direct management of the block pool by vLLMs single type kv cache manager Signed-off-by: Max de Bayser <[email protected]>

maxdebayser changed the title ~~Add prefix caching~~ [WIP]: Add prefix caching Nov 30, 2025

maxdebayser added 2 commits November 30, 2025 20:41

Fix small errors

ed7441f

Now all tests pass with prefix caching disabled. Signed-off-by: Max de Bayser <[email protected]>

Fix prefix caching path

6983f2a

- Disable prefix caching during warmup - Don't try to cache chunks while loading from cache is still incomplete - Set left_padding_mask in dummy output Signed-off-by: Max de Bayser <[email protected]>

maxdebayser added 2 commits December 1, 2025 11:06

fix linting problem

f8dd1d2

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'integrate_block_pool' into sched_agnostic_pc

78a8b84

Signed-off-by: Max de Bayser <[email protected]>

yannicks1 reviewed Dec 1, 2025

View reviewed changes

maxdebayser and others added 12 commits December 1, 2025 18:08

address review comments

0296747

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'main' into integrate_block_pool

8420160

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'integrate_block_pool' into sched_agnostic_pc

fbaf933

Signed-off-by: Max de Bayser <[email protected]>

fix mispelling

ab2f4d3

Signed-off-by: Max de Bayser <[email protected]>

address some review comments

a514310

Signed-off-by: Max de Bayser <[email protected]>

tmp hack: run tests on this branch

ec9b1f6

Signed-off-by: Yannick Schnider <[email protected]>

fix: tmp hack to run tests

76059a4

Signed-off-by: Yannick Schnider <[email protected]>

fix bug when no cache is found

41d3ea9

Signed-off-by: Max de Bayser <[email protected]>

add first unit test prefix caching

dc92a9b

Signed-off-by: Yannick Schnider <[email protected]>

disable prefix caching by default and enable tests

041b3fa

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'main' into integrate_block_pool

a491e7d

Signed-off-by: Max de Bayser <[email protected]>

address review comments

f71df91

Signed-off-by: Max de Bayser <[email protected]>

yannicks1 mentioned this pull request Dec 4, 2025

[PC] Refactor CB model runner to use vLLMs block pool #585

Merged

maxdebayser added 4 commits December 4, 2025 09:34

Merge branch 'integrate_block_pool' into sched_agnostic_pc

e14acff

Signed-off-by: Max de Bayser <[email protected]>

reduce test repetition

4808379

Signed-off-by: Max de Bayser <[email protected]>

revert bad change

f201675

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'integrate_block_pool' into sched_agnostic_pc

4d4a228

Signed-off-by: Max de Bayser <[email protected]>

yannicks1 and others added 5 commits December 8, 2025 15:30

revert tmp hack

882a530

Signed-off-by: Yannick Schnider <[email protected]>

fix isort

cbc2b35

Signed-off-by: Yannick Schnider <[email protected]>

add more tests and fix a small bug in the model runner

ddbbeec

Signed-off-by: Max de Bayser <[email protected]>

appease linter

9559d46

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'sched_agnostic_pc' of github.com:vllm-project/vllm-spyr…

9247e92

…e into sched_agnostic_pc Signed-off-by: Max de Bayser <[email protected]>

update hf_cache

fd7cc90

Signed-off-by: Max de Bayser <[email protected]>

maxdebayser marked this pull request as ready for review December 8, 2025 20:13

maxdebayser requested review from nikolaospapandreou, prashantgupta24, rafvasq, sducouedic and tdoublep as code owners December 8, 2025 20:13

sducouedic reviewed Dec 9, 2025

View reviewed changes

vllm_spyre/v1/worker/spyre_model_runner.py Show resolved Hide resolved

sducouedic reviewed Dec 9, 2025

View reviewed changes

vllm_spyre/v1/worker/spyre_model_runner.py Show resolved Hide resolved

sducouedic reviewed Dec 9, 2025

View reviewed changes

vllm_spyre/v1/worker/spyre_model_runner.py Outdated Show resolved Hide resolved

sducouedic reviewed Dec 9, 2025

View reviewed changes

tests/conftest.py Show resolved Hide resolved

address review comments

2fa71e5

Signed-off-by: Max de Bayser <[email protected]>

sducouedic reviewed Dec 9, 2025

View reviewed changes

vllm_spyre/v1/worker/spyre_model_runner.py Outdated Show resolved Hide resolved

maxdebayser added 2 commits December 9, 2025 15:40

address review comments

5495953

Signed-off-by: Max de Bayser <[email protected]>

improve comment

d9c4d4a

Signed-off-by: Max de Bayser <[email protected]>

sducouedic reviewed Dec 9, 2025

View reviewed changes

Merge branch 'main' into sched_agnostic_pc

f0de597

Signed-off-by: Max de Bayser <[email protected]>

sducouedic approved these changes Dec 9, 2025

View reviewed changes

maxdebayser merged commit c56acd9 into main Dec 10, 2025
21 checks passed

maxdebayser deleted the sched_agnostic_pc branch December 10, 2025 14:03

yannicks1 mentioned this pull request Dec 11, 2025

[fix] backwards compatibility for 0.10.2 #605

Merged

maxdebayser changed the title ~~[WIP]: Add prefix caching~~ Add prefix caching Dec 18, 2025

		self.kv_cache_manager.save_new_computed_blocks(
		scheduler_request.request_id, computed_blocks)

Add prefix caching #586

Add prefix caching #586

Uh oh!

Conversation

maxdebayser commented Nov 30, 2025

Description

Uh oh!

github-actions bot commented Nov 30, 2025

Uh oh!

maxdebayser commented Dec 1, 2025

Uh oh!

maxdebayser commented Dec 1, 2025

Uh oh!

yannicks1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maxdebayser commented Dec 8, 2025

Uh oh!

maxdebayser commented Dec 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sducouedic commented Dec 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sducouedic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants