Vulkan: flash attention for DeepSeek models by ikawrakow · Pull Request #584 · ikawrakow/ik_llama.cpp

ikawrakow · 2025-07-04T16:48:03Z

This PR is a cherry-pick of PR 14509 in mainline llama.cpp with minor adaptations, and adds FA for the DeepSeek models to the Vulkan back-end.

Caveats

The batch size cannot be greater than the maximum context length. Under normal usage this is never the case, but if one runs perplexity with default parameters where context is set to 512 tokens while batch size is 2048 tokens, one gets NaNs after the first context chunk. I have spent the better part of of the day trying to understand the reason, and just don't see it. Almost prepared to give a bounty to the person who finds the bug.
For now KV cache can only be fp16 as I have not implemented the various additions required to make quantized cache work with DeepSeek models in the Vulkan back-end (quantized KV cache can of course be used with models that do not use MLA)

I have tested with DeepSeek-V2-Lite on an RTX-4080 GPU with coopmat2 enabled. We are starting to see more significant performances gains compared to mainline llama.cpp as illustrated in the following two graphs. The first graph shows PP-2048 performance as a function of the number of tokens in the KV cache N_KV. Surprisingly, we don't see significant performance gains from mla = 3 compared to mla = 1 as we do with CUDA (see below). Nevertheless, at 32k tokens ik_llama.cpp is about 40% faster than llama.cpp.

The next graph compares TG performance as a function of N_KV. Here performance gains compared to mainline are even greater, with ik_llama.cpp nearly 2X faster than llama.cpp for a context of 32 tokens.

Before you get too excited about these results, a reminder that the Vulkan back-end does not yet implement the fused MoE ffn_up+ffn_gate op, so it is still far behind CUDA. The next two graphs compare PP and TG performance as a function of N_KV on the same RTX-4080 GPU.

* vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes

jeffbolznv and others added 2 commits July 4, 2025 11:11

vulkan: support mixed/deepseekR1 FA head sizes (#14509)

d924318

* vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes

Fix the FA cherry-pick

64e2d76

ikawrakow merged commit 4622fad into main Jul 5, 2025

ubergarm mentioned this pull request Jul 10, 2025

[Bug] deepseek v3/r1 with full context with balance_serve backend kvcache-ai/ktransformers#1417

Open

5 tasks

firecoperana mentioned this pull request Jul 13, 2025

Vulkan: iquants and flash attention split_k_reduce improvement #598

Closed

4 tasks

ikawrakow mentioned this pull request Jul 14, 2025

Vulkan: a fresh start #608

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan: flash attention for DeepSeek models#584

Vulkan: flash attention for DeepSeek models#584
ikawrakow merged 2 commits intomainfrom
ik/vulkan_fattn

ikawrakow commented Jul 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ikawrakow commented Jul 4, 2025

Caveats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants