[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather by zyongye · Pull Request #26456 · vllm-project/vllm

zyongye · 2025-10-09T01:00:47Z

Replace torch cp_gather_indexer_k_quant_cache to cuda op.
Follow up for #25931

gsm8k 20 shots

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|↑  |0.9378|±  |0.0067|
|     |       |strict-match    |    20|exact_match|↑  |0.9386|±  |0.0066|

Signed-off-by: Yongye Zhu <[email protected]>

gemini-code-assist

Code Review

I've reviewed the changes to integrate the CUDA cp_gather_indexer_k_quant_cache operator. The replacement of the Python-based implementation with the custom CUDA operator is a great performance enhancement. The adjustments at the call site, including the updated allocation for k_scale and the removal of the redundant num_reqs parameter, are correctly handled. The use of .view(torch.float32) on the k_scale tensor before passing it to fp8_mqa_logits is also appropriate given the change in its underlying data type. Overall, this is a solid and well-executed optimization. Great work!

simon-mo · 2025-10-09T01:10:07Z

How’s the perf

zyongye · 2025-10-09T01:26:19Z

Perf is not looking very good on TP8. I ran it with 1000:1000 x 256 and the result.

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  121.73    
Total input tokens:                      255744    
Total generated tokens:                  256000    
Request throughput (req/s):              2.10      
Output token throughput (tok/s):         2103.09   
Peak output token throughput (tok/s):    2816.00   
Peak concurrent requests:                256.00    
Total Token throughput (tok/s):          4204.07   
---------------Time to First Token----------------
Mean TTFT (ms):                          15810.20  
Median TTFT (ms):                        15601.77  
P99 TTFT (ms):                           30674.02  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          104.92    
Median TPOT (ms):                        105.18    
P99 TPOT (ms):                           118.25    
---------------Inter-token Latency----------------
Mean ITL (ms):                           104.92    
Median ITL (ms):                         91.65     
P99 ITL (ms):                            960.69    
==================================================

I am not sure if DEP8 is fixed by now.

heheda12345 · 2025-10-09T06:41:07Z

It's better than the result here.
https://vllm-dev.slack.com/archives/C09JJDVPZH6/p1759219412446959

Maybe you can try benchmarking the perf before / after this PR.

heheda12345

Approve to unblock. But prefer to wait for more benchmark result.

zyongye · 2025-10-14T18:13:47Z

Without custom CUDA kernel, 1000:1000x256. CUDA kernel offers about 5% throughput increase

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  126.09    
Total input tokens:                      255744    
Total generated tokens:                  256000    
Request throughput (req/s):              2.03      
Output token throughput (tok/s):         2030.32   
Peak output token throughput (tok/s):    2816.00   
Peak concurrent requests:                256.00    
Total Token throughput (tok/s):          4058.60   
---------------Time to First Token----------------
Mean TTFT (ms):                          18072.04  
Median TTFT (ms):                        17847.94  
P99 TTFT (ms):                           35035.26  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          107.02    
Median TPOT (ms):                        107.29    
P99 TPOT (ms):                           122.44    
---------------Inter-token Latency----------------
Mean ITL (ms):                           107.02    
Median ITL (ms):                         91.53     
P99 ITL (ms):                            1096.93   
==================================================

…roject#26456) Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: bbartels <[email protected]>

…roject#26456) Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: bogdan01m <[email protected]>

…roject#26456) Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>

…roject#26456) Signed-off-by: Yongye Zhu <[email protected]>

…roject#26456) Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: 0xrushi <[email protected]>

…roject#26456) Signed-off-by: Yongye Zhu <[email protected]>

integrate cuda indexer k cache gather

6d1f666

Signed-off-by: Yongye Zhu <[email protected]>

mergify bot added the deepseek Related to DeepSeek models label Oct 9, 2025

gemini-code-assist bot reviewed Oct 9, 2025

View reviewed changes

Merge branch 'main' into cuda_indexer_k_gather

6f9b3fb

heheda12345 approved these changes Oct 9, 2025

View reviewed changes

heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 9, 2025

zyongye added 5 commits October 9, 2025 11:44

Merge branch 'main' into cuda_indexer_k_gather

60d2706

Merge branch 'main' into cuda_indexer_k_gather

2db8c5a

Merge branch 'main' into cuda_indexer_k_gather

eb3f7d2

Merge branch 'main' into cuda_indexer_k_gather

c473830

Merge branch 'main' into cuda_indexer_k_gather

487e3be

zyongye changed the title ~~Integrate cuda indexer k cache gather~~ [Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather Oct 14, 2025

youkaichao merged commit f5ed68e into vllm-project:main Oct 15, 2025
55 checks passed

bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025

[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather (vllm-p…

568ce25

…roject#26456) Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: bbartels <[email protected]>

bogdanminko pushed a commit to bogdanminko/vllm that referenced this pull request Oct 16, 2025

[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather (vllm-p…

a0d5179

…roject#26456) Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: bogdan01m <[email protected]>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather (vllm-p…

f1ec20f

…roject#26456) Signed-off-by: Yongye Zhu <[email protected]>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather (vllm-p…

a9b456f

…roject#26456) Signed-off-by: Yongye Zhu <[email protected]>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025

[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather (vllm-p…

0f28d59

…roject#26456) Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: 0xrushi <[email protected]>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025

[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather (vllm-p…

9cfa924

…roject#26456) Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: 0xrushi <[email protected]>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather (vllm-p…

23da402

…roject#26456) Signed-off-by: Yongye Zhu <[email protected]>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather (vllm-p…

f88927f

…roject#26456) Signed-off-by: Yongye Zhu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather#26456

[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather#26456
youkaichao merged 7 commits intovllm-project:mainfrom
zyongye:cuda_indexer_k_gather

zyongye commented Oct 9, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

simon-mo commented Oct 9, 2025

Uh oh!

zyongye commented Oct 9, 2025

Uh oh!

heheda12345 commented Oct 9, 2025

Uh oh!

heheda12345 left a comment

Uh oh!

zyongye commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

zyongye commented Oct 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

simon-mo commented Oct 9, 2025

Uh oh!

zyongye commented Oct 9, 2025

Uh oh!

heheda12345 commented Oct 9, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

zyongye commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zyongye commented Oct 9, 2025 •

edited by github-actions bot

Loading