[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather#26456
[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather#26456youkaichao merged 7 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Yongye Zhu <[email protected]>
There was a problem hiding this comment.
Code Review
I've reviewed the changes to integrate the CUDA cp_gather_indexer_k_quant_cache operator. The replacement of the Python-based implementation with the custom CUDA operator is a great performance enhancement. The adjustments at the call site, including the updated allocation for k_scale and the removal of the redundant num_reqs parameter, are correctly handled. The use of .view(torch.float32) on the k_scale tensor before passing it to fp8_mqa_logits is also appropriate given the change in its underlying data type. Overall, this is a solid and well-executed optimization. Great work!
|
How’s the perf |
|
Perf is not looking very good on TP8. I ran it with 1000:1000 x 256 and the result. I am not sure if DEP8 is fixed by now. |
|
It's better than the result here. Maybe you can try benchmarking the perf before / after this PR. |
heheda12345
left a comment
There was a problem hiding this comment.
Approve to unblock. But prefer to wait for more benchmark result.
|
Without custom CUDA kernel, 1000:1000x256. CUDA kernel offers about 5% throughput increase |
…roject#26456) Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: bbartels <[email protected]>
…roject#26456) Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: bogdan01m <[email protected]>
…roject#26456) Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>
…roject#26456) Signed-off-by: Yongye Zhu <[email protected]>
…roject#26456) Signed-off-by: Yongye Zhu <[email protected]>
…roject#26456) Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: 0xrushi <[email protected]>
…roject#26456) Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: 0xrushi <[email protected]>
…roject#26456) Signed-off-by: Yongye Zhu <[email protected]>
…roject#26456) Signed-off-by: Yongye Zhu <[email protected]>
Replace torch
cp_gather_indexer_k_quant_cacheto cuda op.Follow up for #25931
gsm8k 20 shots