Skip to content

llama : rotate activations for better quantization#21038

Merged
ggerganov merged 12 commits intomasterfrom
gg/attn-rot
Apr 1, 2026
Merged

llama : rotate activations for better quantization#21038
ggerganov merged 12 commits intomasterfrom
gg/attn-rot

Conversation

@ggerganov
Copy link
Copy Markdown
Member

@ggerganov ggerganov commented Mar 26, 2026

Overview

In anticipation of the incoming flood of vibe generated PRs implementing TurboQuant, I'm raising the baseline a bit using a very simple interpretation of the idea of using Hadamard transform to reduce outliers in the attention and improve the quantization quality:

  • Rotate input Q, K, V and store in cache
  • Perform attention with the rotated Q, K, V
  • Rotate back the output vector of the attention

This works because:

  • Rotation does not affect the dot product of 2 vectors
  • The output vector is a linear combination of rotated vectors

The implementation is very simple and backend agnostic - use Hadamard matrix of size n x n, normalized by 1/sqrt(n) so that it is orthonormal and can be used both for the forward and backward rotation. Technically any rotation matrix (and it's inverse) should work - I just think this is what is commonly used due to it's simplicity. The implementation does not introduce new types and is compatible with all existing quantizations. It adds 4 matrix multiplication operators in the attention, though I think some of them can be fused in the attention weights (similar to QuaRot).

I don't know what is the impact of the remaining techniques explained in TurboQuant (PolarQuant, QJL, etc.). They could be important and can potentially improve further on top of this. In any case, having a better baseline at almost 0 cost won't hurt. Only based on the PPL data below, I think this should never be worse that before, though it needs a bit more evaluation.

Note: MLA is not supported

Additional information

Here are some PPL results before and after using base models. Ideally, there should be KLD data too, but leaving it for people to play with it and see if it looks good.

./bin/llama-perplexity -m model.gguf -f wiki.test.raw --chunks 128 -ctk f16 -ctv f16

Model: Qwen3 0.6B BF16

https://huggingface.co/Qwen/Qwen3-0.6B-Base

Baseline F16 cache: PPL = 13.6711 +/- 0.21422

type master PR
q8_0 13.9115 13.6713
q5_1 61.6992 14.1452
q5_0 17.2805 14.2171
q4_1 212.479 22.2816
q4_0 62.0161 46.2503

Model: Qwen3 8B BF16

https://huggingface.co/Qwen/Qwen3-8B-Base

Baseline F16 cache: PPL = 7.3203 +/- 0.09901

type master PR
q8_0 7.3172 7.3195
q5_0 7.3793 7.3323
q4_0 7.6451 7.5012

Model: Gemma3 4B Q8_0

https://huggingface.co/google/gemma-3-4b-pt

Baseline F16 cache: PPL = 7.6905 +/- 0.10483

type master PR
q8_0 7.6928 7.6914
q5_1 7.7133 7.7052
q5_0 7.7544 7.7182
q4_1 7.8095 7.7531
q4_0 7.8535 7.7928

Model: Qwen3.5 4B F16

https://huggingface.co/Qwen/Qwen3.5-4B-Base

Baseline F16 cache: PPL = 8.3266 +/- 0.11623

type master PR
q8_0 8.3272 8.3261
q5_0 8.3359 8.3339
q4_0 8.3475 8.3448

TODOs

  • Cache shift support ff76c67
  • Disable rotations with env variable (LLAMA_ATTN_ROT_DISABLE)

Next PRs

  • Add randomized Hadamard matrices (f0fea26)

Requirements

@ubergarm
Copy link
Copy Markdown
Contributor

ubergarm commented Mar 27, 2026

This PR does seem to improve PPL as compared to tip of master using -ctk q4_0 -ctv q4_0.

I ran everything CPU-only backend so as to also compare https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0 from @Aaryan-Kapoor to see how that specific tq3_0 implementation compares.

For reference:

  • f16 = 16 bpw
  • q4_0 = 4.5 bpw
  • tq3_0 = 3.5 bpw

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

k-cache v-cache baseline master@6861f6509 PR gg/attn-rot@e5aa067d6 Aaryan-Kapoor/turboquant-tq3_0@1fb1fb3ab
f16 f16 6.5788 +/- 0.04196, 688.81 tok/sec 6.5788 +/- 0.04196, 721.01 tok/sec
q4_0 q4_0 6.6148 +/- 0.04216, 703.91 tok/sec 6.5962 +/- 0.04208, 540.63 tok/sec
tq3_0 tq3_0 6.6911 +/- 0.04273, 481 tok/sec
👈 Details

Experiment

Measure perplexity against wiki.test.raw varying kv-cache quantization.

Test Quant

Test Rig

  • AMD EPYC 9975 128-Core w/ 12x64GiB DDR5@6400MT/s NPS1 Single Socket (~538 GB/s via mlc)
  • Compiled CPU-only backend
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_VULKAN=OFF
cmake --build build --config Release -j $(nproc)

Command

# seed does nothing as no sampling here, but is fun
model=/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_0.gguf
    #-ctk f16 -ctv f16 \
numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
    -ctk q4_0 -ctv q4_0 \
    --seed 1337 \
    --ctx-size 512 \
    -ub 512 -b 2048 \
    --no-mmap \
    --numa numactl \
    --threads 96 \
    --threads-batch 128

@AesSedai
Copy link
Copy Markdown
Contributor

I've run a pretty big permutation matrix for the Qwen3.5-9B model against the master branch and this PR branch.

The following images are the ratios vs F16/F16:

Master branch PPL and KLD
image
image

and this PR PPL and KLD
image
image

Here's the table of data comparing master vs this PR
k-cache v-cache master PPL master mean KLD attn-rot PPL attn-rot KLD
f16 f16 8.195983 ± 0.055532 0.000782 ± 0.000040 8.195983 ± 0.055532 0.000782 ± 0.000040
f16 q4_0 8.208920 ± 0.055630 0.003747 ± 0.000182 8.208920 ± 0.055630 0.003747 ± 0.000182
f16 q4_1 8.212052 ± 0.055711 0.002985 ± 0.000140 8.212052 ± 0.055711 0.002985 ± 0.000140
f16 q5_0 8.206032 ± 0.055617 0.001752 ± 0.000111 8.206032 ± 0.055617 0.001752 ± 0.000111
f16 q5_1 8.195687 ± 0.055524 0.001393 ± 0.000111 8.195687 ± 0.055524 0.001393 ± 0.000111
f16 q8_0 8.196533 ± 0.055536 0.000804 ± 0.000037 8.196533 ± 0.055536 0.000804 ± 0.000037
q4_0 f16 8.212304 ± 0.055635 0.005426 ± 0.000271 8.212304 ± 0.055635 0.005426 ± 0.000271
q4_0 q4_0 8.227048 ± 0.055766 0.007979 ± 0.000352 8.211705 ± 0.055631 0.005059 ± 0.000224
q4_0 q4_1 8.223944 ± 0.055773 0.007163 ± 0.000298 8.209285 ± 0.055635 0.004664 ± 0.000196
q4_0 q5_0 8.217618 ± 0.055691 0.006165 ± 0.000307 8.201787 ± 0.055561 0.003708 ± 0.000187
q4_0 q5_1 8.210686 ± 0.055628 0.006011 ± 0.000332 8.201859 ± 0.055564 0.003656 ± 0.000203
q4_0 q8_0 8.209199 ± 0.055617 0.005336 ± 0.000272 8.198971 ± 0.055537 0.003334 ± 0.000160
q4_1 f16 8.219674 ± 0.055730 0.004174 ± 0.000239 8.219674 ± 0.055730 0.004174 ± 0.000239
q4_1 q4_0 8.232653 ± 0.055833 0.006612 ± 0.000263 8.213936 ± 0.055656 0.004702 ± 0.000223
q4_1 q4_1 8.235108 ± 0.055910 0.006014 ± 0.000266 8.210796 ± 0.055641 0.004252 ± 0.000183
q4_1 q5_0 8.229521 ± 0.055827 0.004827 ± 0.000253 8.204419 ± 0.055595 0.003456 ± 0.000191
q4_1 q5_1 8.221232 ± 0.055750 0.004739 ± 0.000252 8.203722 ± 0.055583 0.003297 ± 0.000171
q4_1 q8_0 8.219816 ± 0.055743 0.004005 ± 0.000209 8.201606 ± 0.055565 0.002968 ± 0.000156
q5_0 f16 8.202763 ± 0.055572 0.002223 ± 0.000157 8.202763 ± 0.055572 0.002223 ± 0.000157
q5_0 q4_0 8.213422 ± 0.055674 0.005025 ± 0.000243 8.211162 ± 0.055646 0.003654 ± 0.000244
q5_0 q4_1 8.217085 ± 0.055749 0.004181 ± 0.000185 8.206240 ± 0.055612 0.003055 ± 0.000165
q5_0 q5_0 8.214116 ± 0.055692 0.002849 ± 0.000143 8.200742 ± 0.055571 0.002062 ± 0.000124
q5_0 q5_1 8.202209 ± 0.055575 0.002657 ± 0.000170 8.200578 ± 0.055565 0.002028 ± 0.000141
q5_0 q8_0 8.201038 ± 0.055579 0.002310 ± 0.000168 8.197975 ± 0.055550 0.001553 ± 0.000095
q5_1 f16 8.195862 ± 0.055514 0.001843 ± 0.000165 8.195862 ± 0.055514 0.001843 ± 0.000165
q5_1 q4_0 8.208856 ± 0.055624 0.004611 ± 0.000216 8.210008 ± 0.055633 0.003729 ± 0.000258
q5_1 q4_1 8.210093 ± 0.055686 0.003773 ± 0.000176 8.206406 ± 0.055621 0.002889 ± 0.000152
q5_1 q5_0 8.207489 ± 0.055626 0.002754 ± 0.000220 8.200513 ± 0.055562 0.001825 ± 0.000103
q5_1 q5_1 8.195263 ± 0.055514 0.002626 ± 0.000226 8.200874 ± 0.055575 0.001821 ± 0.000121
q5_1 q8_0 8.194897 ± 0.055520 0.001957 ± 0.000178 8.198389 ± 0.055547 0.001750 ± 0.000182
q8_0 f16 8.197374 ± 0.055531 0.000923 ± 0.000075 8.197374 ± 0.055531 0.000923 ± 0.000075
q8_0 q4_0 8.209297 ± 0.055632 0.003642 ± 0.000159 8.208469 ± 0.055623 0.002935 ± 0.000204
q8_0 q4_1 8.210675 ± 0.055697 0.002920 ± 0.000116 8.202587 ± 0.055587 0.002519 ± 0.000149
q8_0 q5_0 8.207077 ± 0.055629 0.001715 ± 0.000095 8.197825 ± 0.055538 0.001263 ± 0.000053
q8_0 q5_1 8.196398 ± 0.055535 0.001484 ± 0.000113 8.200065 ± 0.055566 0.001187 ± 0.000055
q8_0 q8_0 8.196226 ± 0.055531 0.000866 ± 0.000048 8.196227 ± 0.055532 0.000757 ± 0.000039

And attached here is the full raw run data for every invocation:
analysis-results.zip

@Dampfinchen
Copy link
Copy Markdown

Dampfinchen commented Mar 27, 2026

This PR does seem to improve PPL as compared to tip of master using -ctk q4_0 -ctv q4_0.

I ran everything CPU-only backend so as to also compare https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0 from @Aaryan-Kapoor to see how that specific tq3_0 implementation compares.

For reference:

  • f16 = 16 bpw
  • q4_0 = 4.5 bpw
  • tq3_0 = 3.5 bpw

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

k-cache v-cache baseline master@6861f6509 PR gg/attn-rot@e5aa067d6 Aaryan-Kapoor/turboquant-tq3_0@1fb1fb3ab
f16 f16 6.5788 +/- 0.04196, 688.81 tok/sec 6.5788 +/- 0.04196, 721.01 tok/sec
q4_0 q4_0 6.6148 +/- 0.04216, 703.91 tok/sec 6.5962 +/- 0.04208, 540.63 tok/sec
tq3_0 tq3_0 6.6911 +/- 0.04273, 481 tok/sec
👈 Details

Interesting results, which indicate that Q4_0, even before this PR, is superior to TQ_3. However as a word of caution, this is likely due to a very experimental and early implementation of this specific fork and not indicative of the actual performance of TurboQuant, as it is supposed to be effectively lossless compared to fp16 kv cache. That is not the case here.

@ggerganov
Copy link
Copy Markdown
Member Author

ggerganov commented Mar 27, 2026

I think we can actually rotate the V tensor using smaller matrices (64 x 64) which should result in better quality of the V cache. We cannot do that for the Q and K because it would not preserve the dot product.

Pushed a change to do that 832e326 and just from a quick PPL sanity check it looks slightly better.

Note, using 64 instead of 32 because the Metal matrix multiplication kernels require ne00 to be at least 64.

Edit:

We cannot do that for the Q and K because it would not preserve the dot product.

On second thought, I think it does preserve it. Something to try too. Here is the patch:

More rotations for Q and K
diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index 8dfc92b71..84bcf26be 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -2096,8 +2096,9 @@ static std::unique_ptr<llm_graph_input_attn_kv> build_attn_inp_kv_impl(
         const bool can_rot =
             !hparams.is_n_embd_k_gqa_variable() &&
             !hparams.is_n_embd_v_gqa_variable() &&
-            ggml_is_power_of_2(hparams.n_embd_head_k()) &&
+          //ggml_is_power_of_2(hparams.n_embd_head_k()) &&
           //ggml_is_power_of_2(hparams.n_embd_head_v()) &&
+            hparams.n_embd_head_k() % 64 == 0 &&
             hparams.n_embd_head_v() % 64 == 0 &&
             hparams.n_embd_head_k() >= 64 &&
             hparams.n_embd_head_v() >= 64 &&
@@ -2105,7 +2106,7 @@ static std::unique_ptr<llm_graph_input_attn_kv> build_attn_inp_kv_impl(
             ggml_is_quantized(mctx_cur->type_v());
 
         if (can_rot) {
-            const auto nk = hparams.n_embd_head_k();
+            const auto nk = 64;
             const auto nv = 64;
 
             inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nk, nk);
@@ -2453,8 +2454,9 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const
         const bool can_rot =
             !hparams.is_n_embd_k_gqa_variable() &&
             !hparams.is_n_embd_v_gqa_variable() &&
-            ggml_is_power_of_2(hparams.n_embd_head_k()) &&
+          //ggml_is_power_of_2(hparams.n_embd_head_k()) &&
           //ggml_is_power_of_2(hparams.n_embd_head_v()) &&
+            hparams.n_embd_head_k() % 64 == 0 &&
             hparams.n_embd_head_v() % 64 == 0 &&
             hparams.n_embd_head_k() >= 64 &&
             hparams.n_embd_head_v() >= 64 &&
@@ -2462,7 +2464,7 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const
             ggml_is_quantized(mctx_cur->get_base()->type_v());
 
         if (can_rot) {
-            const auto nk = hparams.n_embd_head_k();
+            const auto nk = 64;
             const auto nv = 64;
 
             inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nk, nk);

@Rotatingxenomorph
Copy link
Copy Markdown

This PR does seem to improve PPL as compared to tip of master using -ctk q4_0 -ctv q4_0.
I ran everything CPU-only backend so as to also compare https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0 from @Aaryan-Kapoor to see how that specific tq3_0 implementation compares.
For reference:

  • f16 = 16 bpw
  • q4_0 = 4.5 bpw
  • tq3_0 = 3.5 bpw

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

k-cache v-cache baseline master@6861f6509 PR gg/attn-rot@e5aa067d6 Aaryan-Kapoor/turboquant-tq3_0@1fb1fb3ab
f16 f16 6.5788 +/- 0.04196, 688.81 tok/sec 6.5788 +/- 0.04196, 721.01 tok/sec
q4_0 q4_0 6.6148 +/- 0.04216, 703.91 tok/sec 6.5962 +/- 0.04208, 540.63 tok/sec
tq3_0 tq3_0 6.6911 +/- 0.04273, 481 tok/sec
👈 Details

Interesting results, which indicate that Q4_0, even before this PR, is superior to TQ_3. However as a word of caution, this is likely due to a very experimental and early implementation of this specific fork and not indicative of the actual performance of TurboQuant, as it is supposed to be effectively lossless compared to fp16 kv cache. That is not the case here.

It can't approach turboquant's performance because it's missing the QJL part of turboquant. (I think?)

@AesSedai
Copy link
Copy Markdown
Contributor

I re-ran the suite against the updated new commit, looks like the KLD has generally improved a little bit across the board!

Heatmap results:

image image
Details data table
k-cache v-cache PPL mean KLD
f16 f16 8.195983 ± 0.055532 0.000782 ± 0.000040
f16 q4_0 8.205849 ± 0.055607 0.002518 ± 0.000094
f16 q4_1 8.206960 ± 0.055627 0.002308 ± 0.000088
f16 q5_0 8.197224 ± 0.055540 0.001275 ± 0.000052
f16 q5_1 8.198303 ± 0.055550 0.001126 ± 0.000043
f16 q8_0 8.196163 ± 0.055536 0.000838 ± 0.000058
q4_0 f16 8.202777 ± 0.055567 0.003475 ± 0.000182
q4_0 q4_0 8.208411 ± 0.055610 0.004782 ± 0.000194
q4_0 q4_1 8.210197 ± 0.055641 0.004592 ± 0.000184
q4_0 q5_0 8.199608 ± 0.055539 0.003880 ± 0.000205
q4_0 q5_1 8.200301 ± 0.055547 0.003644 ± 0.000190
q4_0 q8_0 8.199085 ± 0.055545 0.003346 ± 0.000183
q4_1 f16 8.202827 ± 0.055570 0.003155 ± 0.000182
q4_1 q4_0 8.214026 ± 0.055670 0.004670 ± 0.000214
q4_1 q4_1 8.213453 ± 0.055686 0.004174 ± 0.000173
q4_1 q5_0 8.200692 ± 0.055565 0.003446 ± 0.000189
q4_1 q5_1 8.199583 ± 0.055558 0.003457 ± 0.000206
q4_1 q8_0 8.200708 ± 0.055569 0.002958 ± 0.000168
q5_0 f16 8.199265 ± 0.055551 0.001553 ± 0.000099
q5_0 q4_0 8.207457 ± 0.055617 0.003247 ± 0.000161
q5_0 q4_1 8.208826 ± 0.055647 0.002947 ± 0.000144
q5_0 q5_0 8.200150 ± 0.055558 0.002054 ± 0.000158
q5_0 q5_1 8.199639 ± 0.055568 0.001954 ± 0.000120
q5_0 q8_0 8.197875 ± 0.055554 0.001521 ± 0.000093
q5_1 f16 8.200392 ± 0.055557 0.001585 ± 0.000137
q5_1 q4_0 8.210025 ± 0.055650 0.003052 ± 0.000151
q5_1 q4_1 8.207874 ± 0.055632 0.002765 ± 0.000128
q5_1 q5_0 8.200399 ± 0.055565 0.001849 ± 0.000126
q5_1 q5_1 8.199001 ± 0.055561 0.001839 ± 0.000146
q5_1 q8_0 8.199596 ± 0.055562 0.001544 ± 0.000125
q8_0 f16 8.196933 ± 0.055526 0.000837 ± 0.000047
q8_0 q4_0 8.205234 ± 0.055600 0.002568 ± 0.000141
q8_0 q4_1 8.206657 ± 0.055621 0.002230 ± 0.000084
q8_0 q5_0 8.197552 ± 0.055547 0.001229 ± 0.000046
q8_0 q5_1 8.197543 ± 0.055546 0.001300 ± 0.000117
q8_0 q8_0 8.196556 ± 0.055535 0.000799 ± 0.000042

Full run archives:
attn-rot-7711b3a36.zip

@Dampfinchen
Copy link
Copy Markdown

I re-ran the suite against the updated new commit, looks like the KLD has generally improved a little bit across the board!

Heatmap results:

image image
Details data table
k-cache v-cache PPL mean KLD
f16 f16 8.195983 ± 0.055532 0.000782 ± 0.000040
f16 q4_0 8.205849 ± 0.055607 0.002518 ± 0.000094
f16 q4_1 8.206960 ± 0.055627 0.002308 ± 0.000088
f16 q5_0 8.197224 ± 0.055540 0.001275 ± 0.000052
f16 q5_1 8.198303 ± 0.055550 0.001126 ± 0.000043
f16 q8_0 8.196163 ± 0.055536 0.000838 ± 0.000058
q4_0 f16 8.202777 ± 0.055567 0.003475 ± 0.000182
q4_0 q4_0 8.208411 ± 0.055610 0.004782 ± 0.000194
q4_0 q4_1 8.210197 ± 0.055641 0.004592 ± 0.000184
q4_0 q5_0 8.199608 ± 0.055539 0.003880 ± 0.000205
q4_0 q5_1 8.200301 ± 0.055547 0.003644 ± 0.000190
q4_0 q8_0 8.199085 ± 0.055545 0.003346 ± 0.000183
q4_1 f16 8.202827 ± 0.055570 0.003155 ± 0.000182
q4_1 q4_0 8.214026 ± 0.055670 0.004670 ± 0.000214
q4_1 q4_1 8.213453 ± 0.055686 0.004174 ± 0.000173
q4_1 q5_0 8.200692 ± 0.055565 0.003446 ± 0.000189
q4_1 q5_1 8.199583 ± 0.055558 0.003457 ± 0.000206
q4_1 q8_0 8.200708 ± 0.055569 0.002958 ± 0.000168
q5_0 f16 8.199265 ± 0.055551 0.001553 ± 0.000099
q5_0 q4_0 8.207457 ± 0.055617 0.003247 ± 0.000161
q5_0 q4_1 8.208826 ± 0.055647 0.002947 ± 0.000144
q5_0 q5_0 8.200150 ± 0.055558 0.002054 ± 0.000158
q5_0 q5_1 8.199639 ± 0.055568 0.001954 ± 0.000120
q5_0 q8_0 8.197875 ± 0.055554 0.001521 ± 0.000093
q5_1 f16 8.200392 ± 0.055557 0.001585 ± 0.000137
q5_1 q4_0 8.210025 ± 0.055650 0.003052 ± 0.000151
q5_1 q4_1 8.207874 ± 0.055632 0.002765 ± 0.000128
q5_1 q5_0 8.200399 ± 0.055565 0.001849 ± 0.000126
q5_1 q5_1 8.199001 ± 0.055561 0.001839 ± 0.000146
q5_1 q8_0 8.199596 ± 0.055562 0.001544 ± 0.000125
q8_0 f16 8.196933 ± 0.055526 0.000837 ± 0.000047
q8_0 q4_0 8.205234 ± 0.055600 0.002568 ± 0.000141
q8_0 q4_1 8.206657 ± 0.055621 0.002230 ± 0.000084
q8_0 q5_0 8.197552 ± 0.055547 0.001229 ± 0.000046
q8_0 q5_1 8.197543 ± 0.055546 0.001300 ± 0.000117
q8_0 q8_0 8.196556 ± 0.055535 0.000799 ± 0.000042
Full run archives: attn-rot-7711b3a36.zip

What about higher context, like for example 100K? I believe that's where kv cache quantization actually harms the model from multiplications errors that accumulated.

@ggerganov
Copy link
Copy Markdown
Member Author

@AesSedai Thanks for the results. It seems important to track the KLD rather than PPL (maybe more significant for Qwen3.5).

I did some additional tests with randomized Hadamard matrices (as suggested by @sashkboos in #6444 (comment)). Will follow-up in next PRs.


auto * data = (float *) tensor->data;

data[0*n + 0] = 1.0 / sqrtf(n);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always gets me, at least this one makes esthetic sense. :)

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 29, 2026

There is a slowdown which is expected, however probably we should have a flag to opt out?

GGML_CUDA=ON ./scripts/compare-commits.sh upstream/master gg/attn-rot llama-bench -ctk q4_0 -ctv q4_0 -m /opt/models/qwen_3-30b3a-q4_0.gguf -fa 1 -d 0,2048,4096,8192,16384,32768

Model Test t/s 51a84ef t/s gg/attn-rot Speedup
qwen3moe 30B.A3B Q4_0 pp512 7603.68 7409.35 0.97
qwen3moe 30B.A3B Q4_0 pp512@d2048 7085.71 6912.28 0.98
qwen3moe 30B.A3B Q4_0 pp512@d4096 6586.79 6449.26 0.98
qwen3moe 30B.A3B Q4_0 pp512@d8192 5844.15 5694.56 0.97
qwen3moe 30B.A3B Q4_0 pp512@d16384 4704.12 4612.89 0.98
qwen3moe 30B.A3B Q4_0 pp512@d32768 3398.99 3346.66 0.98
qwen3moe 30B.A3B Q4_0 tg128 229.42 202.37 0.88
qwen3moe 30B.A3B Q4_0 tg128@d2048 210.24 188.03 0.89
qwen3moe 30B.A3B Q4_0 tg128@d4096 197.10 177.06 0.90
qwen3moe 30B.A3B Q4_0 tg128@d8192 175.60 159.45 0.91
qwen3moe 30B.A3B Q4_0 tg128@d16384 145.26 134.28 0.92
qwen3moe 30B.A3B Q4_0 tg128@d32768 101.33 94.13 0.93

@handpickencounter
Copy link
Copy Markdown

handpickencounter commented Mar 29, 2026

I don't know what is the impact of the remaining techniques explained in TurboQuant (PolarQuant, QJL, etc.). They could be important and can potentially improve further on top of this.

PolarQuant is just a rotation by a random matrix to spread the energy of any outliers across all dimensions prior to quantization. A normalized Hadamard matrix may or may not produce similar results.

QJL is extremely important however, PolarQuant (and your Hadamard) gives biased attention scores (despite the good seeming MSE) which results in rank flips. QJL uses the 'unreasonable effectiveness of random projections' to give unbiased results. The theory (more accurately, the lemma) is that randomly projecting a vector to lower dimensions will preserve pairwise distances (and therefore dot products) in expectation. TurboQuant quantizes the error residual all the way to 1-bit (the sign) which somehow works well enough.

The best technical yet intuitive explanation I saw on the subject so far is by some AI Researcher from Amazon: https://darshanfofadiya.com/research-papers/turboquant/

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 29, 2026

There is a slowdown which is expected, however probably we should have a flag to opt out?

I don't think we should, the only reason to quantize kv-cache is to save memory, if that comes at the cost of speed, so be it, an option to reduce quality does not make sense (unless you think this is not a general quality improvement).

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 29, 2026

My opinion is that any new breaking change should have an option to turn it off, in case of any unforeseen issues. This change changes outputs of the LLM (even if they are materially better), so just as an example someone having tests downstream would see tests break.

the only reason to quantize kv-cache is to save memory

I think when using -nkvo, it will be faster to transfer a quantized cache to the GPU when offloading, so it can be used for a performance reason also. Also the CUDA fattn vec kernel uses dp4a for q8_0 which is 2x the throughput vs f16. So that statement is just wrong IMO.

@Rotatingxenomorph

This comment was marked as off-topic.

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 29, 2026

My opinion is that any new breaking change should have an option to turn it off, in case of any unforeseen issues.

You could say we are in the business of making breaking changes and that git bisect is our option, but I digress. :)

If anything, I guess adding an env-var for some grace period would be sufficient.

This change changes outputs of the LLM (even if they are materially better), so just as an example someone having tests downstream would see tests break.

Well, until you pointed out the below I would have said that no such test were likely to exist. :)

I think when using -nkvo, it will be faster to transfer a quantized cache to the GPU when offloading, so it can be used for a performance reason also. Also the CUDA fattn vec kernel uses dp4a for q8_0 which is 2x the throughput vs f16. So that statement is just wrong IMO.

@Dampfinchen
Copy link
Copy Markdown

There is a slowdown which is expected, however probably we should have a flag to opt out?

I don't think we should, the only reason to quantize kv-cache is to save memory, if that comes at the cost of speed, so be it, an option to reduce quality does not make sense (unless you think this is not a general quality improvement).

I disagree. You could use quantized KV Cache to save memory in order to put more layers on the GPU. In that case, the purpose of it would be to increase speed and this change would be counterproductive to that goal.

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 29, 2026

There is a slowdown which is expected, however probably we should have a flag to opt out?

I don't think we should, the only reason to quantize kv-cache is to save memory, if that comes at the cost of speed, so be it, an option to reduce quality does not make sense (unless you think this is not a general quality improvement).

I disagree. You could use quantized KV Cache to save memory in order to put more layers on the GPU. In that case, the purpose of it would be to increase speed and this change would be counterproductive to that goal.

You'd most likely gain much more than you lose, so calling it counterproductive is perhaps a bit far fetched.

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 29, 2026

If anything, I guess adding an env-var for some grace period would be sufficient.

Actually I'm not sure it would be worth it to be on by default for q8_0 since the benefits are much lesser compared to q4 and the performance loss is significant

Details
Model Test t/s 51a84ef t/s gg/attn-rot Speedup
qwen3moe 30B.A3B Q4_0 pp512 7576.02 7372.24 0.97
qwen3moe 30B.A3B Q4_0 pp512@d2048 7045.27 6838.27 0.97
qwen3moe 30B.A3B Q4_0 pp512@d4096 6580.31 6435.85 0.98
qwen3moe 30B.A3B Q4_0 pp512@d8192 5792.41 5661.65 0.98
qwen3moe 30B.A3B Q4_0 pp512@d16384 4660.49 4584.87 0.98
qwen3moe 30B.A3B Q4_0 pp512@d32768 3349.27 3300.92 0.99
qwen3moe 30B.A3B Q4_0 tg128 230.59 203.59 0.88
qwen3moe 30B.A3B Q4_0 tg128@d2048 212.31 189.41 0.89
qwen3moe 30B.A3B Q4_0 tg128@d4096 199.79 179.40 0.90
qwen3moe 30B.A3B Q4_0 tg128@d8192 179.48 162.80 0.91
qwen3moe 30B.A3B Q4_0 tg128@d16384 149.03 137.47 0.92
qwen3moe 30B.A3B Q4_0 tg128@d32768 105.50 98.55 0.93

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 29, 2026

If anything, I guess adding an env-var for some grace period would be sufficient.

Actually I'm not sure it would be worth it to be on by default for q8_0 since the benefits are much lesser compared to q4 and the performance loss is significant

Granted, it's probably pointless for q8_0, so perhaps add a check.

@ggerganov
Copy link
Copy Markdown
Member Author

ggerganov commented Mar 29, 2026

It seems that evaluating AIME25 could be another sensitivity test for confirming the improvement of rotating the activations. I did a few runs today with gpt-oss-20b on low reasoning (for faster evaluation):

Here is the table with the hyperlinks preserved for each score:

eval KV type score (no rot) score (rot)
AIME25 x8 F16 37.9%
AIME25 x8 Q8_0 31.7% 37.1%
AIME25 x8 Q5_1 30.8% 32.5%
AIME25 x8 Q5_0 25.4% 32.5%
AIME25 x8 Q4_1 18.3% 28.3%
AIME25 x8 Q4_0 2.0% 21.7%

Used the new llama-eval script to perform these evals: #21152

It's interesting that with Q4_0 KV type without applying rotations, the model completely breaks down - it cannot solve even 1 problem from the 240 attempts. Note that it is still reasoning (i.e. it's not producing garbage outputs). The final answers and logic are just incorrect.

Example of reasoning trace with KV type `Q4_0` without rotations image

Regarding Q8_0 - I think there is some non-negligible quality benefit of applying the rotation here too (31.7% -> 37.1%). Though we need a bit more stats to confirm that. Atm, I am not convinced that disabling the rotations for Q8_0 is the better option, but I'm still considering it.

For reference, the expected score of this eval per the gpt-oss model card is 37.1%:

image

https://arxiv.org/pdf/2508.10925

@Dampfinchen
Copy link
Copy Markdown

It seems that evaluating AIME25 could be another sensitivity test for confirming the improvement of rotating the activations. I did a few runs today with gpt-oss-20b on low reasoning (for faster evaluation):

eval KV type rot score results (HTML)
AIME25 x8 F16 no 37.9% aime2025-gpt-oss-20b-low-x8-kv_f16.json.html
AIME25 x8 Q8_0 no 31.7% aime2025-gpt-oss-20b-low-x8-kv_q8_0.json.html
AIME25 x8 Q8_0 yes 37.1% aime2025-gpt-oss-20b-low-x8-kv_q8_0-rot.json.html
AIME25 x8 Q4_0 no 0.0% aime2025-gpt-oss-20b-low-x8-kv_q4_0.json.html
AIME25 x8 Q4_0 yes 21.7% aime2025-gpt-oss-20b-low-x8-kv_q4_0-rot.json.html
Used the new llama-eval script to perform these evals: #21152

It's interesting that with Q4_0 KV type without applying rotations, the model completely breaks down - it cannot solve even 1 problem from the 240 attempts. Note that it is still reasoning (i.e. it's not producing garbage outputs). The final answers and logic are just incorrect.

Example of reasoning trace with KV type Q4_0 without rotations
image
Regarding Q8_0 - I think there is some non-negligible quality benefit of applying the rotation here too (31.7% -> 37.1%). Though we need a bit more stats to confirm that. Atm, I am not convinced that disabling the rotations for Q8_0 is the better option, but I'm still considering it.

For reference, the expected score of this eval per the gpt-oss model card is 37.1%:

image https://arxiv.org/pdf/2508.10925

Yeah, based on the score, there is a clear benefit from rotating activations for q8_0, and I think it makes sense to have them on by default as the purpose of Q8_0 is to save memory for KV Cache for the same quality and with attn-rot it is noticeably closer to that goal. Furthermore, I'm not sure if AIME25 tests at long context as well. Quality differences might become even more noticeable at a longer context.

Performance is a concern, however. In the end, I think the best course of action would be to make attn-rots an opt-out/opt-in option with a simple flag.

@ggerganov ggerganov merged commit 744c0c7 into master Apr 1, 2026
46 of 47 checks passed
@ggerganov ggerganov deleted the gg/attn-rot branch April 1, 2026 13:58
timothyeburke added a commit to timlikesai/llama.cpp that referenced this pull request Apr 1, 2026
Add MXFP (Microscaling Floating Point) KV cache types for flash attention:
- MXFP4 (E2M1, 4-bit), MXFP6 (E2M3, 6-bit), MXFP8 (E4M3, 8-bit)
- SoA layout for flash attention: [qs_blocks | e8m0_scales] per head
- Element converters with round-to-nearest for E8M0 and mantissa
- FP6 pack/unpack (4 elements in 3 bytes)

Integrate with upstream's graph-level rotation infrastructure (ggml-org#21038):
- MXFP K types use 32x32 rotation matrix (block-aligned with MXFP groups)
- Matrix contains D*H: Davis-Jedwab zigzag sign diagonal (0x3C5A6600)
  composed with Walsh-Hadamard, generated at init
- ggml_mul_mat_aux reshapes to blocks of 32 -> applies D*H per MXFP group
- V rotation disabled for MXFP (hurts well-conditioned models)
- No rotation code in set_rows or FA kernel — graph handles it

CPU flash attention path:
- SoA set_rows: quantizes pre-rotated K/V to SoA MXFP format
- FA kernel: dequantizes SoA K/V to F32 for dot product/accumulation
- Q is already D*H-rotated by graph — used directly in F32
timothyeburke added a commit to timlikesai/llama.cpp that referenced this pull request Apr 1, 2026
Add MXFP (Microscaling Floating Point) KV cache types for flash attention:
- MXFP4 (E2M1, 4-bit), MXFP6 (E2M3, 6-bit), MXFP8 (E4M3, 8-bit)
- SoA layout for flash attention: [qs_blocks | e8m0_scales] per head
- Element converters with round-to-nearest for E8M0 and mantissa
- FP6 pack/unpack (4 elements in 3 bytes)

Integrate with upstream's graph-level rotation infrastructure (ggml-org#21038):
- MXFP K types use 32x32 rotation matrix (block-aligned with MXFP groups)
- Matrix contains D*H: Davis-Jedwab zigzag sign diagonal (0x3C5A6600)
  composed with Walsh-Hadamard, generated at init
- ggml_mul_mat_aux reshapes to blocks of 32 -> applies D*H per MXFP group
- V rotation disabled for MXFP (hurts well-conditioned models)
- No rotation code in set_rows or FA kernel — graph handles it

CPU flash attention path:
- SoA set_rows: quantizes pre-rotated K/V to SoA MXFP format
- FA kernel: dequantizes SoA K/V to F32 for dot product/accumulation
- Q is already D*H-rotated by graph — used directly in F32
timothyeburke added a commit to timlikesai/llama.cpp that referenced this pull request Apr 1, 2026
Add MXFP (Microscaling Floating Point) KV cache types for flash attention:
- MXFP4 (E2M1, 4-bit), MXFP6 (E2M3, 6-bit), MXFP8 (E4M3, 8-bit)
- SoA layout for flash attention: [qs_blocks | e8m0_scales] per head
- Element converters with round-to-nearest for E8M0 and mantissa
- FP6 pack/unpack (4 elements in 3 bytes)

Integrate with upstream's graph-level rotation infrastructure (ggml-org#21038):
- MXFP K types use 32x32 rotation matrix (block-aligned with MXFP groups)
- Matrix contains D*H: Davis-Jedwab zigzag sign diagonal (0x3C5A6600)
  composed with Walsh-Hadamard, generated at init
- ggml_mul_mat_aux reshapes to blocks of 32 -> applies D*H per MXFP group
- V rotation disabled for MXFP (hurts well-conditioned models)
- No rotation code in set_rows or FA kernel — graph handles it

CPU flash attention path:
- SoA set_rows: quantizes pre-rotated K/V to SoA MXFP format
- FA kernel: dequantizes SoA K/V to F32 for dot product/accumulation
- Q is already D*H-rotated by graph — used directly in F32
timothyeburke added a commit to timlikesai/llama.cpp that referenced this pull request Apr 1, 2026
Add MXFP (Microscaling Floating Point) KV cache types for flash attention:
- MXFP4 (E2M1, 4-bit), MXFP6 (E2M3, 6-bit), MXFP8 (E4M3, 8-bit)
- SoA layout for flash attention: [qs_blocks | e8m0_scales] per head
- Element converters with round-to-nearest for E8M0 and mantissa
- FP6 pack/unpack (4 elements in 3 bytes)

Integrate with upstream's graph-level rotation infrastructure (ggml-org#21038):
- MXFP K types use 32x32 rotation matrix (block-aligned with MXFP groups)
- Matrix contains D*H: Davis-Jedwab zigzag sign diagonal (0x3C5A6600)
  composed with Walsh-Hadamard, generated at init
- ggml_mul_mat_aux reshapes to blocks of 32 -> applies D*H per MXFP group
- V rotation disabled for MXFP (hurts well-conditioned models)
- No rotation code in set_rows or FA kernel — graph handles it

CPU flash attention path:
- SoA set_rows: quantizes pre-rotated K/V to SoA MXFP format
- FA kernel: dequantizes SoA K/V to F32 for dot product/accumulation
- Q is already D*H-rotated by graph — used directly in F32
timothyeburke added a commit to timlikesai/llama.cpp that referenced this pull request Apr 1, 2026
Add MXFP (Microscaling Floating Point) KV cache types for flash attention:
- MXFP4 (E2M1, 4-bit), MXFP6 (E2M3, 6-bit), MXFP8 (E4M3, 8-bit)
- SoA layout for flash attention: [qs_blocks | e8m0_scales] per head
- Element converters with round-to-nearest for E8M0 and mantissa
- FP6 pack/unpack (4 elements in 3 bytes)

Integrate with upstream's graph-level rotation infrastructure (ggml-org#21038):
- MXFP K types use 32x32 rotation matrix (block-aligned with MXFP groups)
- Matrix contains D*H: Davis-Jedwab zigzag sign diagonal (0x3C5A6600)
  composed with Walsh-Hadamard, generated at init
- ggml_mul_mat_aux reshapes to blocks of 32 -> applies D*H per MXFP group
- V rotation disabled for MXFP (hurts well-conditioned models)
- No rotation code in set_rows or FA kernel — graph handles it

CPU flash attention path:
- SoA set_rows: quantizes pre-rotated K/V to SoA MXFP format
- FA kernel: dequantizes SoA K/V to F32 for dot product/accumulation
- Q is already D*H-rotated by graph — used directly in F32
@nawoa
Copy link
Copy Markdown

nawoa commented Apr 1, 2026

I'm excited to try this out but at the same time I'm slightly concerned that - unless I missed it, in which case I apologize - throughout this whole PR there were no tests done on long contexts and mainly just speed-related testing done on anything larger than 9B. I admit I'm relatively new to this field but shouldn't larger models (>70B) and longer contexts (>128K) be tested more before merging?

@Rotatingxenomorph
Copy link
Copy Markdown

I'm excited to try this out but at the same time I'm slightly concerned that - unless I missed it, in which case I apologize - throughout this whole PR there were no tests done on long contexts and mainly just speed-related testing done on anything larger than 9B. I admit I'm relatively new to this field but shouldn't larger models (>70B) and longer contexts (>128K) be tested more before merging?

I did mention this but I got put in the naughty corner.

nisten added a commit to nisten/prism-ml-biturbo that referenced this pull request Apr 1, 2026
Applies Walsh-Hadamard rotation to Q, K, V activations before KV cache
quantization, dramatically improving quality for all quantized KV types.

Ported from ggml-org/llama.cpp#21038 (merged 2026-04-01) with fixes for
PrismML's API differences (n_embd_head_k as member vs function).

Key changes:
- Pre-computed Hadamard matrices stored in llama_kv_cache
- Q/K rotated with largest power-of-2 matrix dividing head_dim
- V rotated with fixed 64x64 matrix (empirically better)
- Attention output inverse-rotated after computation
- RoPE shift gets rotation sandwich (inverse before, forward after)
- LLAMA_ATTN_ROT_DISABLE=1 env var to opt out
- Auto-enabled when KV type is quantized and head_dim % 64 == 0
- TBQ4_0 also gets rotation (double rotation with internal FWHT)

Results with q4_0 KV: 48.6 t/s gen (vs 34.4 t/s TBQ4_0), coherent output.
@EAddario
Copy link
Copy Markdown
Contributor

EAddario commented Apr 1, 2026

YOLO

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented Apr 1, 2026

@nawoa FWIW I just ran a quick (by quick I mean took an hour) 7 cases from InfiniteBench on Qwen3.5 9B with Q8 quants, got 7/7 (InfiniteBench tests needle-in-a-haystack with contexts over 128k).

@nawoa
Copy link
Copy Markdown

nawoa commented Apr 1, 2026

If someone wants to give me a command line with whatever arguments necessary, I can run a more comprehensive test overnight on my RTX 6000 Pro 96GB with the model of your choice. I'll be going to bed in 3-4 hours.

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented Apr 1, 2026

@nawoa all's there:

https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/infinite_bench/

To use a local model, you have to:
-> run the server preferrably with a model alias, such as -a local
-> set environment variables LOCAL_API_KEY and LOCAL_BASE_URL
-> run with --model openai-api/local/local (or whatever alias you used)

The way inspect-ai works is that it maps whatever you put as the second part of the model identifier to the uppercase prefix before _BASE_URL and _API_KEY, so you can use openai-=api/foobar/local with FOOBAR_BASE_URL for example.

@JeroenAdam
Copy link
Copy Markdown

Not showing up in the releases because of a failing test in CI: https://github.com/ggml-org/llama.cpp/actions/runs/23837492532/job/69484676982

@CISC
Copy link
Copy Markdown
Member

CISC commented Apr 2, 2026

Not showing up in the releases because of a failing test in CI: https://github.com/ggml-org/llama.cpp/actions/runs/23837492532/job/69484676982

Get b8624 or later instead, there was an intermittent failure for a few releases.

@erazortt
Copy link
Copy Markdown

erazortt commented Apr 2, 2026

ok so I tried @pwilkin's suggested benchmark. I ran the first 100 tests of infinite_bench_longbook_choice_eng using Qwen3.5 35B in instruct mode on a Blackwell 5000 (48GB VRAM). Unfornunately it turns out this test is not sensitive enough to cache quantization: all quants yielded the same results (inside the error bars).

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented Apr 2, 2026

@erazortt you can parametrize the test to make it harder, for example, add more needles.

Wowfunhappy added a commit to Wowfunhappy/llama.cpp that referenced this pull request Apr 3, 2026
Merges ggml-org/llama.cpp upstream (d23355a..7992aa7) including:
- Gemma 4 model support (PR ggml-org#21309)
- KV cache rotation for better quantization (ggml-org#21038)
- Auto GPU memory fitting (llama_params_fit)
- Many new model architectures (Qwen3.5, Kimi K2, LFM2, etc.)

C++14/CUDA 7.5 compatibility fixes applied to merged code:
- Replaced if constexpr with runtime if across CUDA files
- Replaced constexpr __device__ functions with macros
- Replaced structured bindings with .first/.second access
- Replaced std::string_view/std::optional with std::string
- Template specializations for ggml_cuda_cast (convert.cuh)
- BF16 flash attention guarded behind CUDART_VERSION >= 11000
- Eager CUDA context init restored for accurate VRAM on non-VMM GPUs
- Jinja C++17 structured bindings fixed (caused Qwen 3.5 segfault)

Build system updates:
- Added hf-cache-stub.cpp, server-tools-stub.cpp for C++14 compat
- Added mtmd-image.cpp, httplib.cpp to build
- convert_hf_to_gguf.py patched for PyTorch 1.13 compatibility
- gguf vocab.py fallback for old tokenizers library

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mihai-chiorean added a commit to mihai-chiorean/turbo3-cuda that referenced this pull request Apr 4, 2026
…rg#21038)

Upstream master now applies Walsh-Hadamard rotation to K/V/Q before KV
cache storage (commit 744c0c7). This is the same rotation TBQ was
doing independently, causing double rotation after the merge.

TBQ types are now pure codebook quantizers:
- SET_ROWS: normalize + codebook quantize + pack (no FWHT)
- FA dequant: codebook lookup + scale (no Q pre-rotation, no V inverse rotation)
- Standalone dequant: codebook lookup + scale (no inverse FWHT)

Removes ~200 lines of rotation code from CUDA and CPU paths.
Fixes garbage output caused by double WHT rotation after upstream merge.
didlawowo added a commit to didlawowo/llama.cpp that referenced this pull request Apr 4, 2026
- Register GGML_TYPE_TURBO3_0 and GGML_TYPE_TURBO4_0 in kv_cache_types
  so --cache-type-k turbo3 / --cache-type-v turbo3 are recognized
- Fix double V un-rotation: upstream PR ggml-org#21038 (attn_rot_v) already
  handles Hadamard rotation for quantized KV cache types including
  turbo. Make TurboQuant WHT fallback only when upstream rotation
  is not active (else if instead of sequential if blocks)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.