-
Notifications
You must be signed in to change notification settings - Fork 31
Add prefix caching #586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prefix caching #586
Conversation
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
This commit replaces the management of the req_ids2blocks mapping and the direct management of the block pool by vLLMs single type kv cache manager Signed-off-by: Max de Bayser <[email protected]>
This implementation uses the KV Cache Manager to find matching prefixes that satisfy our chunking constraints. Since in the initial version we're going for scheduler agnostic caching, the execute_model() function just executes dummy steps if the chunk was loaded from cache. This is required because the scheduler controls the num_computed_tokens. Signed-off-by: Max de Bayser <[email protected]>
|
👋 Hi! Thank you for contributing to vLLM support on Spyre. Or this can be done with Now you are good to go 🚀 |
Now all tests pass with prefix caching disabled. Signed-off-by: Max de Bayser <[email protected]>
- Disable prefix caching during warmup - Don't try to cache chunks while loading from cache is still incomplete - Set left_padding_mask in dummy output Signed-off-by: Max de Bayser <[email protected]>
|
This PR is now in a state where it is running correctly in a very simple test case where I send the same long prompt twice, (on CPU). |
|
The next steps are:
|
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
yannicks1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks very good! Put some comments. (I did not run any of the code yet)
| vllm_config: VllmConfig, | ||
| is_driver_worker: bool, | ||
| rank: int, | ||
| enable_prefix_caching: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems a bit odd to introduce the enable_prefix_caching arg in the ContinuousBatchingSpyreModelRunner and not in the ChunkedPrefillModelRunner only as CB cannot do prefix caching. But I understand you need it currently in _set_blocks().
A possible solution could be to have an additional arg in _set_blocks e.g. _set_blocks(..., enable_caching=False) which defaults to False. You can then introduce enable_prefix_caching only for the ChunkedPrefillModelRunner and let the default value handle CB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea @yannicks1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but actually set_blocks is called only in ContinuousBatchingSpyreModelrunner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that this PR also disabled prefix caching by default and that the only real impact that setting it to enabled while running with the continuous batching model runner is that the block pool will do a bit of [extra work[(https://github.com/vllm-project/vllm/blob/e858bfe05167a3bbb064e283da5a1a7709dee24e/vllm/v1/core/block_pool.py#L326) when chunks are freed, I've removed the weird constructor argument.
| computed_blocks = computed_blocks[:usable_blocks] | ||
| num_cashed_tokens = usable_blocks * self.block_size | ||
|
|
||
| self.block_pool.touch((computed_blocks, )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this also updating vllm's internal counter for the cache hit rate stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not we should also keep these stats up to date for complete benchmark outputs reporting cache hit rates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BlockPool keeps track of kv cache events if enabled. I didn't do it in the first pass in this PR, but we should definitely reuse this.
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
…e into sched_agnostic_pc Signed-off-by: Max de Bayser <[email protected]>
|
bot:test |
Signed-off-by: Max de Bayser <[email protected]>
|
Caveat: in this current PR we also cache blocks that become full during decoding. But there could be significant numerical differences between blocks generated during prefill and decode. |
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
|
It looks great @maxdebayser, especially that everything is reusing upstream KV cache manager. I'll just give it a run after dinner before approving |
| self.kv_cache_manager.save_new_computed_blocks( | ||
| scheduler_request.request_id, computed_blocks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we can substract the number of reserved blocks here using computed_blocks.
Just did an execution, and I noticed that allocate_new_blocks called after _maybe_load_prefix_from_cache takes into account the number of blocks assigned during save_new_computed_blocks.
But probably we want to do this in another PR and add steps tests for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you're right, it makes sense to subtract that. But actually @yannicks1 was also proposing to remove the reserved blocks tracking altogether because the volumetric constraint is a tighter constraint anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I'm okay with that
Signed-off-by: Max de Bayser <[email protected]>
sducouedic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!!
Description
Note: this PR is stacked on #585
This implementation uses the KV Cache Manager to find matching prefixes that satisfy our chunking constraints. Since in the initial version we're going for scheduler agnostic caching, the execute_model() function just executes dummy steps if the chunk was loaded from cache. This is required because the scheduler controls the num_computed_tokens.