llama : rotate activations for better quantization#21038
Conversation
|
This PR does seem to improve PPL as compared to tip of master using I ran everything CPU-only backend so as to also compare https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0 from @Aaryan-Kapoor to see how that specific For reference:
mainline llama.cpp PPL wiki.test.rawData Resultsmainline llama.cpp PPL wiki.test.raw
👈 DetailsExperimentMeasure perplexity against wiki.test.raw varying kv-cache quantization. Test QuantTest Rig
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_VULKAN=OFF
cmake --build build --config Release -j $(nproc)Command |
|
I've run a pretty big permutation matrix for the Qwen3.5-9B model against the master branch and this PR branch. The following images are the ratios vs F16/F16: Here's the table of data comparing master vs this PR
And attached here is the full raw run data for every invocation: |
Interesting results, which indicate that Q4_0, even before this PR, is superior to TQ_3. However as a word of caution, this is likely due to a very experimental and early implementation of this specific fork and not indicative of the actual performance of TurboQuant, as it is supposed to be effectively lossless compared to fp16 kv cache. That is not the case here. |
|
I think we can actually rotate the V tensor using smaller matrices (64 x 64) which should result in better quality of the V cache. We cannot do that for the Q and K because it would not preserve the dot product. Pushed a change to do that 832e326 and just from a quick PPL sanity check it looks slightly better. Note, using 64 instead of 32 because the Metal matrix multiplication kernels require Edit:
On second thought, I think it does preserve it. Something to try too. Here is the patch: More rotations for Q and Kdiff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index 8dfc92b71..84bcf26be 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -2096,8 +2096,9 @@ static std::unique_ptr<llm_graph_input_attn_kv> build_attn_inp_kv_impl(
const bool can_rot =
!hparams.is_n_embd_k_gqa_variable() &&
!hparams.is_n_embd_v_gqa_variable() &&
- ggml_is_power_of_2(hparams.n_embd_head_k()) &&
+ //ggml_is_power_of_2(hparams.n_embd_head_k()) &&
//ggml_is_power_of_2(hparams.n_embd_head_v()) &&
+ hparams.n_embd_head_k() % 64 == 0 &&
hparams.n_embd_head_v() % 64 == 0 &&
hparams.n_embd_head_k() >= 64 &&
hparams.n_embd_head_v() >= 64 &&
@@ -2105,7 +2106,7 @@ static std::unique_ptr<llm_graph_input_attn_kv> build_attn_inp_kv_impl(
ggml_is_quantized(mctx_cur->type_v());
if (can_rot) {
- const auto nk = hparams.n_embd_head_k();
+ const auto nk = 64;
const auto nv = 64;
inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nk, nk);
@@ -2453,8 +2454,9 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const
const bool can_rot =
!hparams.is_n_embd_k_gqa_variable() &&
!hparams.is_n_embd_v_gqa_variable() &&
- ggml_is_power_of_2(hparams.n_embd_head_k()) &&
+ //ggml_is_power_of_2(hparams.n_embd_head_k()) &&
//ggml_is_power_of_2(hparams.n_embd_head_v()) &&
+ hparams.n_embd_head_k() % 64 == 0 &&
hparams.n_embd_head_v() % 64 == 0 &&
hparams.n_embd_head_k() >= 64 &&
hparams.n_embd_head_v() >= 64 &&
@@ -2462,7 +2464,7 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const
ggml_is_quantized(mctx_cur->get_base()->type_v());
if (can_rot) {
- const auto nk = hparams.n_embd_head_k();
+ const auto nk = 64;
const auto nv = 64;
inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nk, nk); |
It can't approach turboquant's performance because it's missing the QJL part of turboquant. (I think?) |
|
I re-ran the suite against the updated new commit, looks like the KLD has generally improved a little bit across the board! Heatmap results:
Details data table
Full run archives: |
What about higher context, like for example 100K? I believe that's where kv cache quantization actually harms the model from multiplications errors that accumulated. |
|
@AesSedai Thanks for the results. It seems important to track the KLD rather than PPL (maybe more significant for Qwen3.5). I did some additional tests with randomized Hadamard matrices (as suggested by @sashkboos in #6444 (comment)). Will follow-up in next PRs. |
|
|
||
| auto * data = (float *) tensor->data; | ||
|
|
||
| data[0*n + 0] = 1.0 / sqrtf(n); |
There was a problem hiding this comment.
This always gets me, at least this one makes esthetic sense. :)
|
There is a slowdown which is expected, however probably we should have a flag to opt out?
|
PolarQuant is just a rotation by a random matrix to spread the energy of any outliers across all dimensions prior to quantization. A normalized Hadamard matrix may or may not produce similar results. QJL is extremely important however, PolarQuant (and your Hadamard) gives biased attention scores (despite the good seeming MSE) which results in rank flips. QJL uses the 'unreasonable effectiveness of random projections' to give unbiased results. The theory (more accurately, the lemma) is that randomly projecting a vector to lower dimensions will preserve pairwise distances (and therefore dot products) in expectation. TurboQuant quantizes the error residual all the way to 1-bit (the sign) which somehow works well enough. The best technical yet intuitive explanation I saw on the subject so far is by some AI Researcher from Amazon: https://darshanfofadiya.com/research-papers/turboquant/ |
I don't think we should, the only reason to quantize kv-cache is to save memory, if that comes at the cost of speed, so be it, an option to reduce quality does not make sense (unless you think this is not a general quality improvement). |
|
My opinion is that any new breaking change should have an option to turn it off, in case of any unforeseen issues. This change changes outputs of the LLM (even if they are materially better), so just as an example someone having tests downstream would see tests break.
I think when using |
This comment was marked as off-topic.
This comment was marked as off-topic.
You could say we are in the business of making breaking changes and that If anything, I guess adding an env-var for some grace period would be sufficient.
Well, until you pointed out the below I would have said that no such test were likely to exist. :)
|
I disagree. You could use quantized KV Cache to save memory in order to put more layers on the GPU. In that case, the purpose of it would be to increase speed and this change would be counterproductive to that goal. |
You'd most likely gain much more than you lose, so calling it counterproductive is perhaps a bit far fetched. |
Actually I'm not sure it would be worth it to be on by default for q8_0 since the benefits are much lesser compared to q4 and the performance loss is significant Details
|
Granted, it's probably pointless for q8_0, so perhaps add a check. |
|
It seems that evaluating AIME25 could be another sensitivity test for confirming the improvement of rotating the activations. I did a few runs today with Here is the table with the hyperlinks preserved for each score:
Used the new It's interesting that with Regarding For reference, the expected score of this eval per the
|
Yeah, based on the score, there is a clear benefit from rotating activations for q8_0, and I think it makes sense to have them on by default as the purpose of Q8_0 is to save memory for KV Cache for the same quality and with attn-rot it is noticeably closer to that goal. Furthermore, I'm not sure if AIME25 tests at long context as well. Quality differences might become even more noticeable at a longer context. Performance is a concern, however. In the end, I think the best course of action would be to make attn-rots an opt-out/opt-in option with a simple flag. |
Add MXFP (Microscaling Floating Point) KV cache types for flash attention: - MXFP4 (E2M1, 4-bit), MXFP6 (E2M3, 6-bit), MXFP8 (E4M3, 8-bit) - SoA layout for flash attention: [qs_blocks | e8m0_scales] per head - Element converters with round-to-nearest for E8M0 and mantissa - FP6 pack/unpack (4 elements in 3 bytes) Integrate with upstream's graph-level rotation infrastructure (ggml-org#21038): - MXFP K types use 32x32 rotation matrix (block-aligned with MXFP groups) - Matrix contains D*H: Davis-Jedwab zigzag sign diagonal (0x3C5A6600) composed with Walsh-Hadamard, generated at init - ggml_mul_mat_aux reshapes to blocks of 32 -> applies D*H per MXFP group - V rotation disabled for MXFP (hurts well-conditioned models) - No rotation code in set_rows or FA kernel — graph handles it CPU flash attention path: - SoA set_rows: quantizes pre-rotated K/V to SoA MXFP format - FA kernel: dequantizes SoA K/V to F32 for dot product/accumulation - Q is already D*H-rotated by graph — used directly in F32
Add MXFP (Microscaling Floating Point) KV cache types for flash attention: - MXFP4 (E2M1, 4-bit), MXFP6 (E2M3, 6-bit), MXFP8 (E4M3, 8-bit) - SoA layout for flash attention: [qs_blocks | e8m0_scales] per head - Element converters with round-to-nearest for E8M0 and mantissa - FP6 pack/unpack (4 elements in 3 bytes) Integrate with upstream's graph-level rotation infrastructure (ggml-org#21038): - MXFP K types use 32x32 rotation matrix (block-aligned with MXFP groups) - Matrix contains D*H: Davis-Jedwab zigzag sign diagonal (0x3C5A6600) composed with Walsh-Hadamard, generated at init - ggml_mul_mat_aux reshapes to blocks of 32 -> applies D*H per MXFP group - V rotation disabled for MXFP (hurts well-conditioned models) - No rotation code in set_rows or FA kernel — graph handles it CPU flash attention path: - SoA set_rows: quantizes pre-rotated K/V to SoA MXFP format - FA kernel: dequantizes SoA K/V to F32 for dot product/accumulation - Q is already D*H-rotated by graph — used directly in F32
Add MXFP (Microscaling Floating Point) KV cache types for flash attention: - MXFP4 (E2M1, 4-bit), MXFP6 (E2M3, 6-bit), MXFP8 (E4M3, 8-bit) - SoA layout for flash attention: [qs_blocks | e8m0_scales] per head - Element converters with round-to-nearest for E8M0 and mantissa - FP6 pack/unpack (4 elements in 3 bytes) Integrate with upstream's graph-level rotation infrastructure (ggml-org#21038): - MXFP K types use 32x32 rotation matrix (block-aligned with MXFP groups) - Matrix contains D*H: Davis-Jedwab zigzag sign diagonal (0x3C5A6600) composed with Walsh-Hadamard, generated at init - ggml_mul_mat_aux reshapes to blocks of 32 -> applies D*H per MXFP group - V rotation disabled for MXFP (hurts well-conditioned models) - No rotation code in set_rows or FA kernel — graph handles it CPU flash attention path: - SoA set_rows: quantizes pre-rotated K/V to SoA MXFP format - FA kernel: dequantizes SoA K/V to F32 for dot product/accumulation - Q is already D*H-rotated by graph — used directly in F32
Add MXFP (Microscaling Floating Point) KV cache types for flash attention: - MXFP4 (E2M1, 4-bit), MXFP6 (E2M3, 6-bit), MXFP8 (E4M3, 8-bit) - SoA layout for flash attention: [qs_blocks | e8m0_scales] per head - Element converters with round-to-nearest for E8M0 and mantissa - FP6 pack/unpack (4 elements in 3 bytes) Integrate with upstream's graph-level rotation infrastructure (ggml-org#21038): - MXFP K types use 32x32 rotation matrix (block-aligned with MXFP groups) - Matrix contains D*H: Davis-Jedwab zigzag sign diagonal (0x3C5A6600) composed with Walsh-Hadamard, generated at init - ggml_mul_mat_aux reshapes to blocks of 32 -> applies D*H per MXFP group - V rotation disabled for MXFP (hurts well-conditioned models) - No rotation code in set_rows or FA kernel — graph handles it CPU flash attention path: - SoA set_rows: quantizes pre-rotated K/V to SoA MXFP format - FA kernel: dequantizes SoA K/V to F32 for dot product/accumulation - Q is already D*H-rotated by graph — used directly in F32
Add MXFP (Microscaling Floating Point) KV cache types for flash attention: - MXFP4 (E2M1, 4-bit), MXFP6 (E2M3, 6-bit), MXFP8 (E4M3, 8-bit) - SoA layout for flash attention: [qs_blocks | e8m0_scales] per head - Element converters with round-to-nearest for E8M0 and mantissa - FP6 pack/unpack (4 elements in 3 bytes) Integrate with upstream's graph-level rotation infrastructure (ggml-org#21038): - MXFP K types use 32x32 rotation matrix (block-aligned with MXFP groups) - Matrix contains D*H: Davis-Jedwab zigzag sign diagonal (0x3C5A6600) composed with Walsh-Hadamard, generated at init - ggml_mul_mat_aux reshapes to blocks of 32 -> applies D*H per MXFP group - V rotation disabled for MXFP (hurts well-conditioned models) - No rotation code in set_rows or FA kernel — graph handles it CPU flash attention path: - SoA set_rows: quantizes pre-rotated K/V to SoA MXFP format - FA kernel: dequantizes SoA K/V to F32 for dot product/accumulation - Q is already D*H-rotated by graph — used directly in F32
|
I'm excited to try this out but at the same time I'm slightly concerned that - unless I missed it, in which case I apologize - throughout this whole PR there were no tests done on long contexts and mainly just speed-related testing done on anything larger than 9B. I admit I'm relatively new to this field but shouldn't larger models (>70B) and longer contexts (>128K) be tested more before merging? |
I did mention this but I got put in the naughty corner. |
Applies Walsh-Hadamard rotation to Q, K, V activations before KV cache quantization, dramatically improving quality for all quantized KV types. Ported from ggml-org/llama.cpp#21038 (merged 2026-04-01) with fixes for PrismML's API differences (n_embd_head_k as member vs function). Key changes: - Pre-computed Hadamard matrices stored in llama_kv_cache - Q/K rotated with largest power-of-2 matrix dividing head_dim - V rotated with fixed 64x64 matrix (empirically better) - Attention output inverse-rotated after computation - RoPE shift gets rotation sandwich (inverse before, forward after) - LLAMA_ATTN_ROT_DISABLE=1 env var to opt out - Auto-enabled when KV type is quantized and head_dim % 64 == 0 - TBQ4_0 also gets rotation (double rotation with internal FWHT) Results with q4_0 KV: 48.6 t/s gen (vs 34.4 t/s TBQ4_0), coherent output.
|
YOLO |
|
@nawoa FWIW I just ran a quick (by quick I mean took an hour) 7 cases from InfiniteBench on Qwen3.5 9B with Q8 quants, got 7/7 (InfiniteBench tests needle-in-a-haystack with contexts over 128k). |
|
If someone wants to give me a command line with whatever arguments necessary, I can run a more comprehensive test overnight on my RTX 6000 Pro 96GB with the model of your choice. I'll be going to bed in 3-4 hours. |
|
@nawoa all's there: https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/infinite_bench/ To use a local model, you have to: The way |
|
Not showing up in the releases because of a failing test in CI: https://github.com/ggml-org/llama.cpp/actions/runs/23837492532/job/69484676982 |
Get b8624 or later instead, there was an intermittent failure for a few releases. |
|
ok so I tried @pwilkin's suggested benchmark. I ran the first 100 tests of infinite_bench_longbook_choice_eng using Qwen3.5 35B in instruct mode on a Blackwell 5000 (48GB VRAM). Unfornunately it turns out this test is not sensitive enough to cache quantization: all quants yielded the same results (inside the error bars). |
|
@erazortt you can parametrize the test to make it harder, for example, add more needles. |
Merges ggml-org/llama.cpp upstream (d23355a..7992aa7) including: - Gemma 4 model support (PR ggml-org#21309) - KV cache rotation for better quantization (ggml-org#21038) - Auto GPU memory fitting (llama_params_fit) - Many new model architectures (Qwen3.5, Kimi K2, LFM2, etc.) C++14/CUDA 7.5 compatibility fixes applied to merged code: - Replaced if constexpr with runtime if across CUDA files - Replaced constexpr __device__ functions with macros - Replaced structured bindings with .first/.second access - Replaced std::string_view/std::optional with std::string - Template specializations for ggml_cuda_cast (convert.cuh) - BF16 flash attention guarded behind CUDART_VERSION >= 11000 - Eager CUDA context init restored for accurate VRAM on non-VMM GPUs - Jinja C++17 structured bindings fixed (caused Qwen 3.5 segfault) Build system updates: - Added hf-cache-stub.cpp, server-tools-stub.cpp for C++14 compat - Added mtmd-image.cpp, httplib.cpp to build - convert_hf_to_gguf.py patched for PyTorch 1.13 compatibility - gguf vocab.py fallback for old tokenizers library Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rg#21038) Upstream master now applies Walsh-Hadamard rotation to K/V/Q before KV cache storage (commit 744c0c7). This is the same rotation TBQ was doing independently, causing double rotation after the merge. TBQ types are now pure codebook quantizers: - SET_ROWS: normalize + codebook quantize + pack (no FWHT) - FA dequant: codebook lookup + scale (no Q pre-rotation, no V inverse rotation) - Standalone dequant: codebook lookup + scale (no inverse FWHT) Removes ~200 lines of rotation code from CUDA and CPU paths. Fixes garbage output caused by double WHT rotation after upstream merge.
- Register GGML_TYPE_TURBO3_0 and GGML_TYPE_TURBO4_0 in kv_cache_types so --cache-type-k turbo3 / --cache-type-v turbo3 are recognized - Fix double V un-rotation: upstream PR ggml-org#21038 (attn_rot_v) already handles Hadamard rotation for quantized KV cache types including turbo. Make TurboQuant WHT fallback only when upstream rotation is not active (else if instead of sequential if blocks)












Overview
In anticipation of the incoming flood of vibe generated PRs implementing TurboQuant, I'm raising the baseline a bit using a very simple interpretation of the idea of using Hadamard transform to reduce outliers in the attention and improve the quantization quality:
This works because:
The implementation is very simple and backend agnostic - use Hadamard matrix of size
n x n, normalized by1/sqrt(n)so that it is orthonormal and can be used both for the forward and backward rotation. Technically any rotation matrix (and it's inverse) should work - I just think this is what is commonly used due to it's simplicity. The implementation does not introduce new types and is compatible with all existing quantizations. It adds 4 matrix multiplication operators in the attention, though I think some of them can be fused in the attention weights (similar to QuaRot).I don't know what is the impact of the remaining techniques explained in TurboQuant (PolarQuant, QJL, etc.). They could be important and can potentially improve further on top of this. In any case, having a better baseline at almost 0 cost won't hurt. Only based on the PPL data below, I think this should never be worse that before, though it needs a bit more evaluation.
Note: MLA is not supported
Additional information
Here are some PPL results before and after using base models. Ideally, there should be KLD data too, but leaving it for people to play with it and see if it looks good.
Model: Qwen3 0.6B BF16
https://huggingface.co/Qwen/Qwen3-0.6B-Base
Baseline F16 cache: PPL = 13.6711 +/- 0.21422
Model: Qwen3 8B BF16
https://huggingface.co/Qwen/Qwen3-8B-Base
Baseline F16 cache: PPL = 7.3203 +/- 0.09901
Model: Gemma3 4B Q8_0
https://huggingface.co/google/gemma-3-4b-pt
Baseline F16 cache: PPL = 7.6905 +/- 0.10483
Model: Qwen3.5 4B F16
https://huggingface.co/Qwen/Qwen3.5-4B-Base
Baseline F16 cache: PPL = 8.3266 +/- 0.11623
TODOs
LLAMA_ATTN_ROT_DISABLE)Next PRs
Requirements