-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Description
Motivation.
As large language models (LLMs) support increasingly longer token contexts, vLLM currently employs the chunkprefill mechanism to handle long sequences: long sequences are split into individual chunks, and ring_attention is called to compute the attention results.
However, the implementation of chunkprefill has two issues:
(1)Each chunk is computed serially, so a single request fails to fully leverage the advantages of concurrency in a multi-GPU environment;
(2)Each GPU within the tensor parallelism (TP) domain stores identical KV caches (kvcache), leading to resource waste.
To address the above issues, we aim to resolve them by introducing the CP function and enhancing the SP capability.
As mentioned in the paper, https://arxiv.org/pdf/2411.01783,The advantages of Context parallel lie in:
(1)Compute parallelization: CP distributes computation across multiple GPUs in order to reduce latency, in contrast with pipeline parallelization (PP) that improves throughput but not latency.
(2)Communication message size reduction: Compared to tensor parallelism (TP), CP demands less communication bandwidth in multi-host environments, by maintaining a communication size that is orders of magnitude smaller than TP, especially for inter-node communication.
(3)KV cache distribution: Key and value (KV) embeddings grow linearly with context length. CP distributes the storage of KV embeddings across multiple GPUs, enabling larger batch sizes with the addition of more CP ranks.
for Context Parallel, the request is split into CP size, each CP group deals a part of the request, and store kvcache in its own cp group.
for Sequecne Parallel, vllm supprts it using compilation pass,what we do is to save kvcache in each sp group.
Proposed Change.
Suppose there are 4 cards, with CP=2, SP=2(This means TP=2 plus enabled sequence parallelism.), batch size=1, seqlen=511, and block_size=128.
CP communication domains: [0, 2], [1, 3]
SP communication domains: [0, 1], [2, 3]
The splitting is performed along two dimensions: the KV cache management dimension and the computation dimension.
KV Cache Management Dimension Splitting:
This is primarily aimed at applying for the block table. The sequence is split based on SP and CP:
Card 0 is assigned 128 tokens;
Card 1 is assigned 128 tokens;
Card 2 is assigned 128 tokens;
Card 3 is assigned 127 tokens.
In the prefill phase, each card performs its own Attention computation and then invokes ring-attention to obtain the final result. After the computation is completed:
Card 0 stores the KV cache of the first 128 tokens;
Card 1 stores the KV cache of tokens 129–255;
And so on, with each card only saving the KV cache corresponding to its assigned segment of the entire sequence.
Computation Dimension Splitting:
This mainly focuses on load balancing. The sequence is evenly split into each CP group, and within each CP group, the corresponding computation is performed through SP.
for Sequence Parallel:
For Context Parallel:
The changes that need to be made are as follows:
(1)The creation of Context Parallel communication domains(It is not reflected in the figure).
(2)Sequence Parallel communication domains reuses the Tensor Parallel communication domains(It is not reflected in the figure)
(3)Modify Schedule to allocate block_tables
(4)Modify Prepare_inputs for ModelRunner
(5)Modify attention for cp/cp commution

Feedback Period.
No response
CC List.
No response
Any Other Things.
Someone submitted an RFC, but it was put on hold.:
#7519
The idea is roughly the same, and we want to complete it.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.