Skip to content

[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather#26456

Merged
youkaichao merged 7 commits intovllm-project:mainfrom
zyongye:cuda_indexer_k_gather
Oct 15, 2025
Merged

[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather#26456
youkaichao merged 7 commits intovllm-project:mainfrom
zyongye:cuda_indexer_k_gather

Conversation

@zyongye
Copy link
Copy Markdown
Member

@zyongye zyongye commented Oct 9, 2025

Replace torch cp_gather_indexer_k_quant_cache to cuda op.
Follow up for #25931

gsm8k 20 shots

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|↑  |0.9378|±  |0.0067|
|     |       |strict-match    |    20|exact_match|↑  |0.9386|±  |0.0066|

@mergify mergify bot added the deepseek Related to DeepSeek models label Oct 9, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

I've reviewed the changes to integrate the CUDA cp_gather_indexer_k_quant_cache operator. The replacement of the Python-based implementation with the custom CUDA operator is a great performance enhancement. The adjustments at the call site, including the updated allocation for k_scale and the removal of the redundant num_reqs parameter, are correctly handled. The use of .view(torch.float32) on the k_scale tensor before passing it to fp8_mqa_logits is also appropriate given the change in its underlying data type. Overall, this is a solid and well-executed optimization. Great work!

@simon-mo
Copy link
Copy Markdown
Collaborator

simon-mo commented Oct 9, 2025

How’s the perf

@zyongye
Copy link
Copy Markdown
Member Author

zyongye commented Oct 9, 2025

Perf is not looking very good on TP8. I ran it with 1000:1000 x 256 and the result.

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  121.73    
Total input tokens:                      255744    
Total generated tokens:                  256000    
Request throughput (req/s):              2.10      
Output token throughput (tok/s):         2103.09   
Peak output token throughput (tok/s):    2816.00   
Peak concurrent requests:                256.00    
Total Token throughput (tok/s):          4204.07   
---------------Time to First Token----------------
Mean TTFT (ms):                          15810.20  
Median TTFT (ms):                        15601.77  
P99 TTFT (ms):                           30674.02  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          104.92    
Median TPOT (ms):                        105.18    
P99 TPOT (ms):                           118.25    
---------------Inter-token Latency----------------
Mean ITL (ms):                           104.92    
Median ITL (ms):                         91.65     
P99 ITL (ms):                            960.69    
==================================================

I am not sure if DEP8 is fixed by now.

@heheda12345
Copy link
Copy Markdown
Collaborator

It's better than the result here.
https://vllm-dev.slack.com/archives/C09JJDVPZH6/p1759219412446959

Maybe you can try benchmarking the perf before / after this PR.

Copy link
Copy Markdown
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve to unblock. But prefer to wait for more benchmark result.

@heheda12345 heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 9, 2025
@zyongye zyongye changed the title Integrate cuda indexer k cache gather [Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather Oct 14, 2025
@zyongye
Copy link
Copy Markdown
Member Author

zyongye commented Oct 14, 2025

Without custom CUDA kernel, 1000:1000x256. CUDA kernel offers about 5% throughput increase

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  126.09    
Total input tokens:                      255744    
Total generated tokens:                  256000    
Request throughput (req/s):              2.03      
Output token throughput (tok/s):         2030.32   
Peak output token throughput (tok/s):    2816.00   
Peak concurrent requests:                256.00    
Total Token throughput (tok/s):          4058.60   
---------------Time to First Token----------------
Mean TTFT (ms):                          18072.04  
Median TTFT (ms):                        17847.94  
P99 TTFT (ms):                           35035.26  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          107.02    
Median TPOT (ms):                        107.29    
P99 TPOT (ms):                           122.44    
---------------Inter-token Latency----------------
Mean ITL (ms):                           107.02    
Median ITL (ms):                         91.53     
P99 ITL (ms):                            1096.93   
==================================================

@youkaichao youkaichao merged commit f5ed68e into vllm-project:main Oct 15, 2025
55 checks passed
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
bogdanminko pushed a commit to bogdanminko/vllm that referenced this pull request Oct 16, 2025
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 16, 2025
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants