Skip to content

Conversation

@maxdebayser
Copy link
Collaborator

Description

Note: this PR is stacked on #585

This implementation uses the KV Cache Manager to find matching prefixes that satisfy our chunking constraints. Since in the initial version we're going for scheduler agnostic caching, the execute_model() function just executes dummy steps if the chunk was loaded from cache. This is required because the scheduler controls the num_computed_tokens.

This commit replaces the management of the req_ids2blocks mapping
and the direct management of the block pool by vLLMs
single type kv cache manager

Signed-off-by: Max de Bayser <[email protected]>
This implementation uses the KV Cache Manager to find
matching prefixes that satisfy our chunking constraints.
Since in the initial version we're going for scheduler
agnostic caching, the execute_model() function just
executes dummy steps if the chunk was loaded from cache.
This is required because the scheduler controls the
num_computed_tokens.

Signed-off-by: Max de Bayser <[email protected]>
@github-actions
Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

@maxdebayser maxdebayser changed the title Add prefix caching [WIP]: Add prefix caching Nov 30, 2025
Now all tests pass with prefix caching disabled.

Signed-off-by: Max de Bayser <[email protected]>
- Disable prefix caching during warmup
- Don't try to cache chunks while loading from cache
  is still incomplete
- Set left_padding_mask in dummy output

Signed-off-by: Max de Bayser <[email protected]>
@maxdebayser
Copy link
Collaborator Author

This PR is now in a state where it is running correctly in a very simple test case where I send the same long prompt twice, (on CPU).

@maxdebayser
Copy link
Collaborator Author

The next steps are:

  • test on the spyre device
  • figure out what to do about the enable-prefix-caching setting which is on by default in vllm
  • write comprehensive tests

Copy link
Collaborator

@yannicks1 yannicks1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks very good! Put some comments. (I did not run any of the code yet)

vllm_config: VllmConfig,
is_driver_worker: bool,
rank: int,
enable_prefix_caching: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems a bit odd to introduce the enable_prefix_caching arg in the ContinuousBatchingSpyreModelRunner and not in the ChunkedPrefillModelRunner only as CB cannot do prefix caching. But I understand you need it currently in _set_blocks().

A possible solution could be to have an additional arg in _set_blocks e.g. _set_blocks(..., enable_caching=False) which defaults to False. You can then introduce enable_prefix_caching only for the ChunkedPrefillModelRunner and let the default value handle CB.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea @yannicks1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but actually set_blocks is called only in ContinuousBatchingSpyreModelrunner

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that this PR also disabled prefix caching by default and that the only real impact that setting it to enabled while running with the continuous batching model runner is that the block pool will do a bit of [extra work[(https://github.com/vllm-project/vllm/blob/e858bfe05167a3bbb064e283da5a1a7709dee24e/vllm/v1/core/block_pool.py#L326) when chunks are freed, I've removed the weird constructor argument.

computed_blocks = computed_blocks[:usable_blocks]
num_cashed_tokens = usable_blocks * self.block_size

self.block_pool.touch((computed_blocks, ))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also updating vllm's internal counter for the cache hit rate stats?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not we should also keep these stats up to date for complete benchmark outputs reporting cache hit rates

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BlockPool keeps track of kv cache events if enabled. I didn't do it in the first pass in this PR, but we should definitely reuse this.

yannicks1 and others added 5 commits December 8, 2025 15:30
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
@maxdebayser
Copy link
Collaborator Author

bot:test

Signed-off-by: Max de Bayser <[email protected]>
@maxdebayser
Copy link
Collaborator Author

Caveat: in this current PR we also cache blocks that become full during decoding. But there could be significant numerical differences between blocks generated during prefill and decode.

Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
@sducouedic
Copy link
Collaborator

It looks great @maxdebayser, especially that everything is reusing upstream KV cache manager. I'll just give it a run after dinner before approving

Comment on lines +2169 to +2170
self.kv_cache_manager.save_new_computed_blocks(
scheduler_request.request_id, computed_blocks)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can substract the number of reserved blocks here using computed_blocks.
Just did an execution, and I noticed that allocate_new_blocks called after _maybe_load_prefix_from_cache takes into account the number of blocks assigned during save_new_computed_blocks.

But probably we want to do this in another PR and add steps tests for that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you're right, it makes sense to subtract that. But actually @yannicks1 was also proposing to remove the reserved blocks tracking altogether because the volumetric constraint is a tighter constraint anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I'm okay with that

Copy link
Collaborator

@sducouedic sducouedic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

@maxdebayser maxdebayser merged commit c56acd9 into main Dec 10, 2025
21 checks passed
@maxdebayser maxdebayser deleted the sched_agnostic_pc branch December 10, 2025 14:03
@maxdebayser maxdebayser changed the title [WIP]: Add prefix caching Add prefix caching Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants