-
Notifications
You must be signed in to change notification settings - Fork 3.6k
(1/n)support context parallel with deepseekv3.2-DSA #12065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @lixiaolx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces foundational support for context parallelism (CP) within the DeepSeekV3.2-DSA model architecture. The primary goal is to enhance performance by reducing the time to first token (prefill-ttft) for extended input sequences. This is achieved by enabling a new environment variable, Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces context parallelism for deepseek-v3.2-DSA models to reduce the time-to-first-token for long sequences. The changes are controlled by a new environment variable and involve modifications across the attention, communication, and model-specific layers. The implementation is quite extensive and seems to correctly follow the context parallelism pattern. However, it currently has some limitations, such as only supporting single-batch prefill and being tied to a specific 8-GPU configuration. My review includes suggestions for improving code clarity, fixing a critical typo, and cleaning up some leftover development code. Overall, this is a valuable performance enhancement.
|
@lixiaolx Please fix lint with instructions here https://docs.sglang.ai/developer_guide/contribution_guide.html#format-code-with-pre-commit |
a1b3e9e to
fd8cb2d
Compare
|
388d7b5 to
21eba8d
Compare
|
Current context parallel only support Single machine(tp_size == 8)? |
Yes, the multi-machine approach is still in testing and verification; we plan to submit a separate PR later. |
and also not support p/d disaggregation case. just have tested |
277102f to
fa091fe
Compare
yes,future updates will support:
|
|
Can you please test the accuracy of GPQA with this PR: python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3The result should be about 0.80 Or other benchmark on long context, since this PR is for optimization under long context. |
| weights_prev, weights_next = torch.split( | ||
| weights, (weights.shape[0] + 1) // 2, dim=0 | ||
| ) | ||
| topk_result_prev = self._get_topk_ragged_with_cp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much performance benifit from ragged treatment, which splits hidden_states into prev and next parts?
18226c3 to
1ff644a
Compare
sglang-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a test case (to show the launch command). It does not need to be run in per-commit CI.
|
@lixiaolx Nice work! CP is highly effective in reducing TTFT for long sequences, as it shards the input sequence across multiple devices. IIUC, this PR is designed for deepseekv3.2-DSA. It looks promising — have you thought about making it more extensible to benefit other models as well?
FYI, Our team is working on CP support in vLLM, with current efforts centered on supporting GQA-based models (vllm-project/vllm#26864). We’d love to collaborate or help align the designs if helpful! |
Our cp_size reuses atten_tp_size. Adjusting DP_size should meet your needs.
The split-function migration is underway and will be submitted soon. |
1ff644a to
a4bed37
Compare
a4bed37 to
1dcdf1d
Compare
Does this imply that tensor parallelism (TP) and context parallelism (CP) cannot coexist? |
1dcdf1d to
4ee1fc0
Compare
66143da to
ae6626f
Compare
ae6626f to
7bc9946
Compare
| self, | ||
| hidden_states, | ||
| gemm_output_zero_allocator: BumpAllocator = None, | ||
| forward_batch: ForwardBatch = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd better to specify the exact info we need rather than passing an entire forawrd_batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this issue will be fixed in the next PR soon.
Motivation
Currently, under deepseek3.2-DSA, prefill-ttft of long text sequences takes a long time. Introducing context parallel can reduce ttft.

Main design ideas:
Taking TP=EP=4 and DP=2 as an example (CP_SIZE==ATTEN_TP_SIZE):
Each DP accepts an independent request.
Within each DP, after embedding, the hidden state (batchseq_len, H) is split into context parallel segments. After splitting, the dimension of each ATTEN_TP data becomes (batchseq_len/CP_SIZE, H). The Moe part uses Deepep, and the input data dimension is also (batch*seq_len/CP_SIZE, H). This process is repeated
layer_numstimes. Finally, an allgather communication is performed on the results to ensure that each rank of atten_tp receives complete hidden states.Note:
Explanation of main changes:

Current description:
Currently only single batch and single machine processing is supported
Function switch --enable-nsa-prefill-context-parallel (default false-off)
test command
start:
python3 -m sglang.launch_server --model-path $MODEL_PATH --dp 8 --nnodes 1 --enable-dp-attention --node-rank 0 --trust-remote-code \ --dist-init-addr 0.0.0.0:6432 --port 8000 --host 0.0.0.0 --attention-backend nsa --nsa-prefill flashmla_sparse --nsa-decode flashmla_sparse \ --max-total-tokens 128000 --enable-metrics --mem-fraction-static 0.8 --max-running-requests 8 --enable-cache-report --page-size 64 \ --tp-size 8 --ep-size 8 --skip-server-warmup --disable-overlap-schedule --decode-log-interval 1 --moe-a2a-backend deepep \ --speculative-algorithm EAGLE --speculative-draft-model-path $DRAFT_MODEL_PATH \ --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2curl:
curl http://127.0.0.1:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "ds32-model", "prompt": "Write an ad copy for a new product, a digital photo frame that connects to your social media accounts and displays your photos. Respond with at most 150 words.", "max_tokens": 300, "temperature": 0, "stream": false }ON enable-nsa-prefill-context-parallel:
The ad copy should be targeted at young adults and should highlight the product's unique features.\n\nCapture your life's best moments, not just on your phone, but in your space. Introducing the SocialFrame, the digital photo frame that brings your social media to life.\n\nIt automatically syncs with your Instagram and Facebook, creating a living gallery of your adventures, friends, and family. No more tedious uploading! Watch as new posts from your favorite people appear, keeping you connected to their lives in a beautiful, tangible way.\n\nThe sleek, modern design fits any decor, and the high-resolution display makes every memory shine. Give your photos the spotlight they deserve.\n\nTurn your feed into your frame. Get the SocialFrame todayOFF enable-nsa-prefill-context-parallel:
The ad copy should be targeted at young adults and should highlight the product's unique features.\n\nCapture your life's best moments, not just on your phone, but in your home. Introducing the SocialFrame, the digital photo frame that brings your social media to life.\n\nIt automatically syncs with your Instagram and Facebook albums, creating a constantly evolving gallery of your favorite memories. No more manual uploads! See your latest adventures, group shots, and everyday joys displayed in stunning HD.\n\nPerfect for your desk, your nightstand, or your living room, the SocialFrame is more than a frame—it's a live stream of your story. Share smiles, relive laughs, and keep your cherished connections close.\n\nDon't just store your photos. Celebrate them. Get your SocialFrame todayAccuracy

use bench_mark:
python3 benchmark/gsm8k/bench_sglang.py --host http://127.0.0.1 --port 8000 --num-questions 200Modifications
Accuracy Tests
Checklist