Merged
Conversation
added 7 commits
January 28, 2025 11:56
Similar to pg, but it only looks at TG speed with a given prompt length.
They still need to be divisible by 32.
... on Zen4. Also fix q8_0 K-cache for head sizes that are not multiple of 128.
ikawrakow
pushed a commit
that referenced
this pull request
Jan 29, 2025
ikawrakow
added a commit
that referenced
this pull request
Jan 30, 2025
* Slightly faster AVX2 implementation for q4_k_r4 * Even better AVX2 implementation for q4_k_r4 We now arrive at PP-512 = 328 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU, up from 291 t/s when I last measured on 3c5f872. With FA and Q8_0 K-cache we get to 339.5 t/s. * Fix llama-bench labels that I broke with #181 * Faster AVX2 implementation for q5_k_q4 We arrive at 302 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU, up from 273 t/s. * Use AVX2 implementation of q4_k_r4 and q5_k_r4 also on Zen4 After the changes I made to AVX2, it ends up being slightly faster compared to what I had for Zen4. * Minor tweak * Cleanup --------- Co-authored-by: Iwan Kawrakow <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR started by me adding the
-gpoption tollama-benchas per ggml-org/llama.cpp#11126 because I wanted to test TG performance after a long prompt to be able to compare to the MLA attention implementation in ggml-org/llama.cpp#11446.But then I noticed that the repacked
Q8_0andQ4_0quants do not work for row tensor sizes that are not a multiple of 128 (4 x block size of 32), which is the case for some of the tensors in Deepseek2-Lite that I used for testing, so I fixed that.And than I was comparing performance after the fix on
Llama-3.2-1B, and noticed that FA withQ8_0K-cache does not work.Llama-3.2-1Bhas a head size of 64 and there was a comment in the code thatQ8_0does not work for a head sizes less than 128, so I fixed that as well.