Skip to content

[Performance]: Prefix-caching aware scheduling #7883

@comaniac

Description

@comaniac

Proposal to improve performance

The current execution flow with prefix caching is as follows:

  1. Scheduler takes the next prefill sequence:
    a. Calculate how many blocks it needs.
    b. Check whether we have sufficient number of blocks in the block manager.
    c. If so, determine the number of tokens to be prefilled in this batch (it is equal to the prompt length without chunked prefill, or at maximum the chunked size otherwise).
    d. Update the batch token budget by subtracting the tokens to be prefilled.
    e. Allocate all (regardless how many tokens to prefill in this batch) blocks.
    f. Match allocated block IDs with prefix cache, and list them in computed_block_nums.
  2. Prepare input:
    a. Get the number of tokens to prefill for this sequence in this batch.
    b. Setup input token IDs and positions.
    c. If computed_block_nums is not none, then remove the cached tokens from input tokens, and adjust input positions, query length and context length accordingly.
  3. Execute the model.

The inefficiencies are then:

  1. In Step 1.b, we now don't consider prefix caching. Taking a sequence with 16 blocks in prompt as an example, it now requires block manager to have 16 free blocks to be scheduled. However, assuming 12 of 16 blocks are already cached, we actually only need free 4 blocks to schedule this sequence.
  2. In Step 1.d, we now don't consider prefix caching. Assuming the number of batched tokens is set to 2048, and we scheduled 2 sequences with 1024 tokens each. However, if the first 512 prefix tokens are already cached, then the batch size is actually 1024 instead of 2048.

The above inefficiencies come from the fact that we know which blocks are cached starting from Step 1.f. Thus, we propose the following changes:

  1. Improve can_allocate in block manager at Step 1.b to consider prefix caching. For example, although a sequence needs 16 blocks for its prompt, can_allocate could still return True even the block manager only has 4 blocks left when 12 of 16 blocks are already cached.
  2. Improve Step 1.c in the scheduler to consider prefix caching. Specifically, this step should guarantee the number of new tokens to prefill are not cached. If an entire prompt of a sequence is cached, we should only compute the last token.

cc @rkooo567 @sighingnow @Juelianqvq @zhuohan123

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is neededperformancePerformance-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions