Support DP MLA #1970

ispobock · 2024-11-09T05:46:08Z

Motivation

Support data parallel on MLA for DeepSeek model to reduce replicated KV cache.

python -m sglang.launch_server --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --trust-remote-code --tp 8 --dp 8 --enable-dp-attention

Modifications

Add --enable-dp-attention option. When it is turned on, DP and TP share the same workers.
Add IDLE forward mode for workers that do not have sequence to forward but need TP sync with other workers.
Implement model forward with DP attention + TP MoE.

Performance

Compared to the main branch, this PR improves the prefill throughput by 20% and the decode throughput by 67% for DeepSeek-V2 model on 8*H100.

DP+TP (this PR):

prefill: 21658.78 toks/s
decode: 11174.62 toks/s

TP (main branch):

prefill: 17941.92 toks/s
decode: 6656.75 toks/s

Reproduce:

# DP+TP
python -m sglang.launch_server --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --disable-radix-cache --trust-remote-code --tp 8 --dp 8 --enable-dp-attention
# TP
python -m sglang.launch_server --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --disable-radix-cache --trust-remote-code --tp 8

# bench prefill
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 512 --random-output 1 --random-range-ratio 1 --num-prompts 10000
# bench decode
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 10000

TODO

Compatible with cuda graph: Support cuda graph for DP attention #2061
Compatible with overlap mode

fengyang95 · 2024-11-11T02:13:20Z

@ispobock How much performance improvement is expected? Is it mainly in throughput or latency?

ispobock · 2024-11-11T13:53:35Z

How much performance improvement is expected? Is it mainly in throughput or latency?

@fengyang95 There is an issue with dp 8. I will test the performance once the issue fixed. It's mainly for throughput.

fengyang95 · 2024-11-13T12:44:48Z

Compatible with cuda graph

Compatible with overlap mode

@ispobock hi, when will support for cuda graph be planned? It is critical for latency improvement.

python/sglang/srt/server_args.py

python/sglang/srt/models/deepseek_v2.py

python/sglang/srt/managers/scheduler.py

python/sglang/srt/managers/schedule_batch.py

python/sglang/srt/managers/scheduler.py

ispobock · 2024-11-14T00:30:58Z

when will support for cuda graph be planned?

I will support it soon. The code is almost done and needs some tests.

fengyang95 · 2024-11-14T11:15:57Z

when will support for cuda graph be planned?

I will support it soon. The code is almost done and needs some tests.

@ispobock How much additional VRAM would this require approximately?

ispobock · 2024-11-15T18:34:02Z

How much additional VRAM would this require approximately?

@fengyang95 For the 236B V2 model, if use DP attention, the total weights will take ~570G, so prefer to use FP8 quantized model for better performance.

merrymercy

Great work!

fengyang95 · 2024-12-02T07:57:12Z

@ispobock Does this support W4A16? My VRAM is very limited, and even using fp8 VRAM is not enough.

ispobock · 2025-01-07T06:51:22Z

Does this support W4A16?

@fengyang95 AWQ is supported in #2364.

faradawn · 2025-10-21T17:03:36Z

Hi, for --tp 8 --dp 8 --enable-dp-attention, 1) is it that we replicate the attention layer (the weights W^Q, W^KV) 8 times? 2) Is it that we pass different data to each TP worker? 3) If without dp attention, would TP=8 and DP=8 require 64 GPUs?

chenhao-stick-to · 2025-10-31T07:08:12Z

您好，关于 TP 工作线程--tp 8 --dp 8 --enable-dp-attention，1）是否需要将注意力层（权重 W^Q、W^KV）复制 8 次？2）是否需要将不同的数据传递给每个 TP 工作线程？3）如果不使用动态规划注意力机制，TP=8 和 DP=8 是否需要 64 个 GPU？

关于这种情况，结合dp attention的真实目的是减少mla模型重复kvcache缓存以及底层其实是复用一个tp group的实现来看，我说说我的看法（如有不对处，欢迎一起讨论）：
1）不需要，实际上还是会将完整注意力层切成8份（按照tp8操作），然后计算后再all gather到对应的逻辑dp rank，这里显然有8个逻辑dp rank，所以其实只是gather到对应的dprank
2）据我所了解的情况来看，是的，需要这么多GPU。

ispobock added 5 commits November 9, 2024 12:55

add args and share worker

d6d5ed4

update head num

9962b66

add idle batch

ee70ff3

update model

d1cdd90

fix dist

63525b2

ispobock requested review from ByronHsu, Ying1123, hnyls2002, merrymercy and zhyncs as code owners November 9, 2024 05:46

ispobock added 3 commits November 9, 2024 14:59

fix lint

cb23a72

fix group

d7e79da

fix logits

94091e8

merrymercy self-assigned this Nov 9, 2024

zhyncs self-assigned this Nov 12, 2024

ispobock and others added 2 commits November 13, 2024 00:32

Merge branch 'main' into dp-mla

6c05bf3

set chunked prefill size

48086c7

ispobock changed the title ~~[WIP] Support DP MLA~~ Support DP MLA Nov 12, 2024

Merge branch 'main' into dp-mla

d04b89f

merrymercy reviewed Nov 13, 2024

View reviewed changes

python/sglang/srt/server_args.py Outdated Show resolved Hide resolved

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

python/sglang/srt/models/deepseek_v2.py Show resolved Hide resolved

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

merrymercy reviewed Nov 13, 2024

View reviewed changes

python/sglang/srt/managers/schedule_batch.py Outdated Show resolved Hide resolved

merrymercy reviewed Nov 13, 2024

View reviewed changes

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

merrymercy reviewed Nov 13, 2024

View reviewed changes

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

merrymercy mentioned this pull request Nov 14, 2024

[Feature] Expert parallelism support #1435

Closed

2 tasks

update dp

68150db

ispobock and others added 3 commits November 15, 2024 18:00

add default settings

83f85a4

add test

a79dd89

Merge branch 'main' into dp-mla

3c0f1cd

merrymercy approved these changes Nov 16, 2024

View reviewed changes

Merge branch 'main' into dp-mla

064eb79

merrymercy enabled auto-merge (squash) November 16, 2024 08:57

merrymercy merged commit 976bc30 into sgl-project:main Nov 16, 2024
13 checks passed

ispobock mentioned this pull request Nov 17, 2024

Support cuda graph for DP attention #2061

Merged

merrymercy mentioned this pull request Nov 24, 2024

Development Roadmap (2024 Q4) #1487

Closed

37 tasks

ispobock mentioned this pull request Jan 7, 2025

[Bug] launch dsv2 service failed #2763

Closed

5 tasks

zhaochenyang20 mentioned this pull request Mar 3, 2025

Development Roadmap (2025 H1) #4035

Closed

22 tasks

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Support DP MLA (sgl-project#1970)

1389eef

Support DP MLA #1970

Support DP MLA #1970

Uh oh!

Conversation

ispobock commented Nov 9, 2024 • edited by merrymercy Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Performance

TODO

Uh oh!

fengyang95 commented Nov 11, 2024

Uh oh!

ispobock commented Nov 11, 2024

Uh oh!

fengyang95 commented Nov 13, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ispobock commented Nov 14, 2024

Uh oh!

fengyang95 commented Nov 14, 2024

Uh oh!

ispobock commented Nov 15, 2024

Uh oh!

merrymercy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fengyang95 commented Dec 2, 2024

Uh oh!

ispobock commented Jan 7, 2025

Uh oh!

faradawn commented Oct 21, 2025

Uh oh!

chenhao-stick-to commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ispobock commented Nov 9, 2024 •

edited by merrymercy

Loading

merrymercy left a comment •

edited

Loading