[Feature]  Implement Decode Context Parallel in SGLang

### Checklist

- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.

### Motivation

Decoding very long sequences can overwhelm memory due to KV cache growth, and tensor parallelism (TP) alone often isn’t sufficient. I propose adding **Decode Context Parallel (Decode CP)** to SGLang, inspired by vLLM’s approach and supported by the paper "Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding", Decode CP partitions and processes the context during the decode phase across multiple devices (or compute units on a single device), enabling efficient long-context handling and avoid kv cache redundancy. Using MQA as a example， 
+ With TP2： each rank keeps a copy of the sequence's  kv cache, the maximum sequence length is still bounded by the memory of a single device. 
+ With DP2:  each device processes a different sequence; for any given sequence, its KV cache resides on one device, and the maximum length is again limited by that device’s memory.
+ With CP2:  increasing the cp_size partitions the KV cache across devices, eliminating redundancy and enabling much longer sequences, with the effective memory budget scaling roughly linearly with the number of CP shards.
 
<img width="1790" height="790" alt="Image" src="https://github.com/user-attachments/assets/8adb6ddf-ae28-4074-82fc-247291cc0566" />

It makes sense in MLA and GQA where tp size is greater than heads count as well. 

**Motivation**
+ Support much longer context lengths (e.g., 64k/128k/256k) by reducing per-device KV cache footprint.
**High-Level Proposal**
+ KV cache partitioning and management:
  + Split KV cache along the sequence dimension or block-wise; ensure compatibility with paged attention and efficient allocation/eviction.
+ Cross-shard attention during decode:
  + Implement communication patterns such as ring attention or reduce-scatter/all-gather so each step can access the required context across shards.
+ Scheduling and topology:
  + Introduce a context_parallel_size topology in the scheduler; route requests based on sequence length and resource availability; co-plan with TP/DP
+ Compute-communication overlap:
  + Overlap attention computation with inter-device communication to reduce synchronization overhead.

**References**
+ Research Paper: [Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding](https://arxiv.org/pdf/2507.07120)
+ vLLM Implementation Guide: [Context Parallel Deployment](https://github.com/vllm-project/vllm/blob/main/docs/serving/context_parallel_deployment.md)
+ vLLM Implementation: [[Feature] Support Decode Context Parallel (DCP) for MLA #23734](https://github.com/vllm-project/vllm/pull/23734), [[DCP] Support Decode Context Parallel (DCP) for GQA with FlashAttention](https://github.com/vllm-project/vllm/pull/24864)

### Related resources

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Implement Decode Context Parallel in SGLang #12196

Checklist

Motivation

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Implement Decode Context Parallel in SGLang #12196

Description

Checklist

Motivation

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions