-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Closed
rickyyx/vllm
#1Labels
help wantedExtra attention is neededExtra attention is neededperformancePerformance-related issuesPerformance-related issues
Description
Proposal to improve performance
The current execution flow with prefix caching is as follows:
- Scheduler takes the next prefill sequence:
a. Calculate how many blocks it needs.
b. Check whether we have sufficient number of blocks in the block manager.
c. If so, determine the number of tokens to be prefilled in this batch (it is equal to the prompt length without chunked prefill, or at maximum the chunked size otherwise).
d. Update the batch token budget by subtracting the tokens to be prefilled.
e. Allocate all (regardless how many tokens to prefill in this batch) blocks.
f. Match allocated block IDs with prefix cache, and list them incomputed_block_nums. - Prepare input:
a. Get the number of tokens to prefill for this sequence in this batch.
b. Setup input token IDs and positions.
c. Ifcomputed_block_numsis not none, then remove the cached tokens from input tokens, and adjust input positions, query length and context length accordingly. - Execute the model.
The inefficiencies are then:
- In Step 1.b, we now don't consider prefix caching. Taking a sequence with 16 blocks in prompt as an example, it now requires block manager to have 16 free blocks to be scheduled. However, assuming 12 of 16 blocks are already cached, we actually only need free 4 blocks to schedule this sequence.
- In Step 1.d, we now don't consider prefix caching. Assuming the number of batched tokens is set to 2048, and we scheduled 2 sequences with 1024 tokens each. However, if the first 512 prefix tokens are already cached, then the batch size is actually 1024 instead of 2048.
The above inefficiencies come from the fact that we know which blocks are cached starting from Step 1.f. Thus, we propose the following changes:
- Improve
can_allocatein block manager at Step 1.b to consider prefix caching. For example, although a sequence needs 16 blocks for its prompt,can_allocatecould still return True even the block manager only has 4 blocks left when 12 of 16 blocks are already cached. - Improve Step 1.c in the scheduler to consider prefix caching. Specifically, this step should guarantee the number of new tokens to prefill are not cached. If an entire prompt of a sequence is cached, we should only compute the last token.
cc @rkooo567 @sighingnow @Juelianqvq @zhuohan123
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
rkooo567, rickyyx and zhouyuan
Metadata
Metadata
Assignees
Labels
help wantedExtra attention is neededExtra attention is neededperformancePerformance-related issuesPerformance-related issues