Skip to content

[RFC]: Context Parallelism && Sequence Parallelism #22693

@zhenwenqi2024

Description

@zhenwenqi2024

Motivation.

As large language models (LLMs) support increasingly longer token contexts, vLLM currently employs the chunkprefill mechanism to handle long sequences: long sequences are split into individual chunks, and ring_attention is called to compute the attention results.

However, the implementation of chunkprefill has two issues:
(1)Each chunk is computed serially, so a single request fails to fully leverage the advantages of concurrency in a multi-GPU environment;
(2)Each GPU within the tensor parallelism (TP) domain stores identical KV caches (kvcache), leading to resource waste.

To address the above issues, we aim to resolve them by introducing the CP function and enhancing the SP capability.

As mentioned in the paper, https://arxiv.org/pdf/2411.01783,The advantages of Context parallel lie in:

(1)Compute parallelization: CP distributes computation across multiple GPUs in order to reduce latency, in contrast with pipeline parallelization (PP) that improves throughput but not latency.
(2)Communication message size reduction: Compared to tensor parallelism (TP), CP demands less communication bandwidth in multi-host environments, by maintaining a communication size that is orders of magnitude smaller than TP, especially for inter-node communication.
(3)KV cache distribution: Key and value (KV) embeddings grow linearly with context length. CP distributes the storage of KV embeddings across multiple GPUs, enabling larger batch sizes with the addition of more CP ranks.

for Context Parallel, the request is split into CP size, each CP group deals a part of the request, and store kvcache in its own cp group.

Image

for Sequecne Parallel, vllm supprts it using compilation pass,what we do is to save kvcache in each sp group.

Image

Proposed Change.

Suppose there are 4 cards, with CP=2, SP=2(This means TP=2 plus enabled sequence parallelism.), batch size=1, seqlen=511, and block_size=128.
CP communication domains: [0, 2], [1, 3]
SP communication domains: [0, 1], [2, 3]
The splitting is performed along two dimensions: the KV cache management dimension and the computation dimension.
KV Cache Management Dimension Splitting:
This is primarily aimed at applying for the block table. The sequence is split based on SP and CP:
Card 0 is assigned 128 tokens;
Card 1 is assigned 128 tokens;
Card 2 is assigned 128 tokens;
Card 3 is assigned 127 tokens.
In the prefill phase, each card performs its own Attention computation and then invokes ring-attention to obtain the final result. After the computation is completed:
Card 0 stores the KV cache of the first 128 tokens;
Card 1 stores the KV cache of tokens 129–255;
And so on, with each card only saving the KV cache corresponding to its assigned segment of the entire sequence.
Computation Dimension Splitting:
This mainly focuses on load balancing. The sequence is evenly split into each CP group, and within each CP group, the corresponding computation is performed through SP.

for Sequence Parallel:

Image

For Context Parallel:

Image

The changes that need to be made are as follows:
(1)The creation of Context Parallel communication domains(It is not reflected in the figure).
(2)Sequence Parallel communication domains reuses the Tensor Parallel communication domains(It is not reflected in the figure)
(3)Modify Schedule to allocate block_tables
(4)Modify Prepare_inputs for ModelRunner
(5)Modify attention for cp/cp commution
Image

Feedback Period.

No response

CC List.

No response

Any Other Things.

Someone submitted an RFC, but it was put on hold.:
#7519
The idea is roughly the same, and we want to complete it.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions