[Performance]: Prefix-caching aware scheduling

### Proposal to improve performance

The current execution flow with prefix caching is as follows:
1. Scheduler takes the next prefill sequence:
    a. Calculate how many blocks it needs.
    b. Check whether we have sufficient number of blocks in the block manager.
    c. If so, determine the number of tokens to be prefilled in this batch (it is equal to the prompt length without chunked prefill, or at maximum the chunked size otherwise).
    d. Update the batch token budget by subtracting the tokens to be prefilled.
    e. Allocate all (regardless how many tokens to prefill in this batch) blocks.
    f. Match allocated block IDs with prefix cache, and list them in `computed_block_nums`.
2. Prepare input:
    a. Get the number of tokens to prefill for this sequence in this batch.
    b. Setup input token IDs and positions.
    c. If `computed_block_nums` is not none, then remove the cached tokens from input tokens, and adjust input positions, query length and context length accordingly.
3. Execute the model.

The inefficiencies are then:
1. In Step 1.b, we now don't consider prefix caching. Taking a sequence with 16 blocks in prompt as an example, it now requires block manager to have 16 free blocks to be scheduled. However, assuming 12 of 16 blocks are already cached, we actually only need free 4 blocks to schedule this sequence.
2. In Step 1.d, we now don't consider prefix caching. Assuming the number of batched tokens is set to 2048, and we scheduled 2 sequences with 1024 tokens each. However, if the first 512 prefix tokens are already cached, then the batch size is actually 1024 instead of 2048.

The above inefficiencies come from the fact that we know which blocks are cached starting from Step 1.f. Thus, we propose the following changes:
1. Improve `can_allocate` in block manager at Step 1.b to consider prefix caching. For example, although a sequence needs 16 blocks for its prompt, `can_allocate` could still return True even the block manager only has 4 blocks left when 12 of 16 blocks are already cached.
2. Improve Step 1.c in the scheduler to consider prefix caching. Specifically, this step should guarantee the number of new tokens to prefill are not cached. If an entire prompt of a sequence is cached, we should only compute the last token.

cc @rkooo567 @sighingnow @Juelianqvq @zhuohan123 

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: Prefix-caching aware scheduling #7883

Proposal to improve performance

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: Prefix-caching aware scheduling #7883

Description

Proposal to improve performance

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions