Vulkan: flash attention for DeepSeek models#584
Merged
Conversation
* vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes
Open
5 tasks
4 tasks
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is a cherry-pick of PR 14509 in mainline
llama.cppwith minor adaptations, and adds FA for the DeepSeek models to the Vulkan back-end.Caveats
perplexitywith default parameters where context is set to 512 tokens while batch size is 2048 tokens, one gets NaNs after the first context chunk. I have spent the better part of of the day trying to understand the reason, and just don't see it. Almost prepared to give a bounty to the person who finds the bug.fp16as I have not implemented the various additions required to make quantized cache work with DeepSeek models in the Vulkan back-end (quantized KV cache can of course be used with models that do not use MLA)I have tested with DeepSeek-V2-Lite on an RTX-4080 GPU with coopmat2 enabled. We are starting to see more significant performances gains compared to mainline
llama.cppas illustrated in the following two graphs. The first graph shows PP-2048 performance as a function of the number of tokens in the KV cacheN_KV. Surprisingly, we don't see significant performance gains frommla = 3compared tomla = 1as we do with CUDA (see below). Nevertheless, at 32k tokensik_llama.cppis about 40% faster thanllama.cpp.The next graph compares TG performance as a function of
N_KV. Here performance gains compared to mainline are even greater, withik_llama.cppnearly 2X faster thanllama.cppfor a context of 32 tokens.Before you get too excited about these results, a reminder that the Vulkan back-end does not yet implement the fused MoE
ffn_up+ffn_gateop, so it is still far behind CUDA. The next two graphs compare PP and TG performance as a function ofN_KVon the same RTX-4080 GPU.