feat: mtp support dp-attention#6081
Conversation
7b9e06d to
9441246
Compare
|
After testing, the error is as follows: When |
lambert0312 The latest commit( #5256543 has fixed this bug. Thanks! |
@u4lr451 Great, it has been verified to work properly, but the speed is much slower than when dp-attention is not enabled. Why is this? |
@lambert0312 The choice between pure TP, or enable DP-attention depends on multiple factors, such as GPU model, request batch size/concurrency, DP parameters, business SLA requirements,etc.
|
adcb787 to
b61e3a6
Compare
|
@ch-wan @fzyzcjy @merrymercy @zhyncs hi, would someone mind checking if this is ready to merge? Thanks! |
|
Open DP attention, MTP, cuda graph found that the performance dropped very much, analyzed and found that it was because the reception rate dropped very much. This caused the throughput to drop. open cuda graph: |
…upport_dp_attention
…upport_dp_attention
|
bugs: |
…upport_dp_attention
…ist" This reverts commit 3f686b1.
ch-wan
left a comment
There was a problem hiding this comment.
Thank you for this excellent contribution. It represents a major optimization for boosting the throughput of DeepSeek-V3/R1, with its correctness and effectiveness verified by many contributors and users from the community. The current implementation looks solid to me.
For future PRs, consider these remaining optimizations:
- Enabling CUDA graphs for idle batches during
verifyordraft_after_decode. This was previously implemented but reverted by me to unblock merging this PR. - Migrating DP attention support to #6995. The current setup requires capturing 3 CUDA graphs and creating 3 gathered_buffers, which consumes unnecessary memory.
- Reducing scheduling overhead. The current approach may invoke
all_gather_into_tensortwice to check for idle batches, potentially lowering end-to-end throughput in some scenarios.
|
when I use following args for DeepSeek-R1 model, not use another draft model, like this: --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 raise such Error: do you know why and how to fix it? complete logs are: I use |
|
@Xuweijia-buaa see this: #7506 |

Motivation
mtp support dp-attention
Checklist
Accuracy
use MMLU benchmark
Average accuracy: 0.887
Average accuracy: 0.887