Skip to content

Conversation

@ispobock
Copy link
Collaborator

@ispobock ispobock commented Nov 9, 2024

Motivation

Support data parallel on MLA for DeepSeek model to reduce replicated KV cache.

python -m sglang.launch_server --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --trust-remote-code --tp 8 --dp 8 --enable-dp-attention

Modifications

  • Add --enable-dp-attention option. When it is turned on, DP and TP share the same workers.
  • Add IDLE forward mode for workers that do not have sequence to forward but need TP sync with other workers.
  • Implement model forward with DP attention + TP MoE.

Performance

Compared to the main branch, this PR improves the prefill throughput by 20% and the decode throughput by 67% for DeepSeek-V2 model on 8*H100.

DP+TP (this PR):

  • prefill: 21658.78 toks/s
  • decode: 11174.62 toks/s

TP (main branch):

  • prefill: 17941.92 toks/s
  • decode: 6656.75 toks/s

Reproduce:

# DP+TP
python -m sglang.launch_server --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --disable-radix-cache --trust-remote-code --tp 8 --dp 8 --enable-dp-attention
# TP
python -m sglang.launch_server --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --disable-radix-cache --trust-remote-code --tp 8

# bench prefill
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 512 --random-output 1 --random-range-ratio 1 --num-prompts 10000
# bench decode
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 10000

TODO

@merrymercy merrymercy self-assigned this Nov 9, 2024
@fengyang95
Copy link

@ispobock How much performance improvement is expected? Is it mainly in throughput or latency?

@ispobock
Copy link
Collaborator Author

How much performance improvement is expected? Is it mainly in throughput or latency?

@fengyang95 There is an issue with dp 8. I will test the performance once the issue fixed. It's mainly for throughput.

@zhyncs zhyncs self-assigned this Nov 12, 2024
@ispobock ispobock changed the title [WIP] Support DP MLA Support DP MLA Nov 12, 2024
@fengyang95
Copy link

  • Compatible with cuda graph
  • Compatible with overlap mode

@ispobock hi, when will support for cuda graph be planned? It is critical for latency improvement.

@ispobock
Copy link
Collaborator Author

when will support for cuda graph be planned?

I will support it soon. The code is almost done and needs some tests.

@fengyang95
Copy link

when will support for cuda graph be planned?

I will support it soon. The code is almost done and needs some tests.

@ispobock How much additional VRAM would this require approximately?

@ispobock
Copy link
Collaborator Author

How much additional VRAM would this require approximately?

@fengyang95 For the 236B V2 model, if use DP attention, the total weights will take ~570G, so prefer to use FP8 quantized model for better performance.

Copy link
Contributor

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@merrymercy merrymercy enabled auto-merge (squash) November 16, 2024 08:57
@merrymercy merrymercy merged commit 976bc30 into sgl-project:main Nov 16, 2024
13 checks passed
@merrymercy merrymercy mentioned this pull request Nov 24, 2024
37 tasks
@fengyang95
Copy link

@ispobock Does this support W4A16? My VRAM is very limited, and even using fp8 VRAM is not enough.

@ispobock ispobock mentioned this pull request Jan 7, 2025
5 tasks
@ispobock
Copy link
Collaborator Author

ispobock commented Jan 7, 2025

Does this support W4A16?

@fengyang95 AWQ is supported in #2364.

@zhaochenyang20 zhaochenyang20 mentioned this pull request Mar 3, 2025
22 tasks
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
@faradawn
Copy link
Contributor

Hi, for --tp 8 --dp 8 --enable-dp-attention, 1) is it that we replicate the attention layer (the weights W^Q, W^KV) 8 times? 2) Is it that we pass different data to each TP worker? 3) If without dp attention, would TP=8 and DP=8 require 64 GPUs?

@chenhao-stick-to
Copy link

您好,关于 TP 工作线程--tp 8 --dp 8 --enable-dp-attention,1)是否需要将注意力层(权重 W^Q、W^KV)复制 8 次?2)是否需要将不同的数据传递给每个 TP 工作线程?3)如果不使用动态规划注意力机制,TP=8 和 DP=8 是否需要 64 个 GPU?

关于这种情况,结合dp attention的真实目的是减少mla模型重复kvcache缓存以及底层其实是复用一个tp group的实现来看,我说说我的看法(如有不对处,欢迎一起讨论):
1)不需要,实际上还是会将完整注意力层切成8份(按照tp8操作),然后计算后再all gather到对应的逻辑dp rank,这里显然有8个逻辑dp rank,所以其实只是gather到对应的dprank
2)据我所了解的情况来看,是的,需要这么多GPU。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants