llama : rotate activations for better quantization by ggerganov · Pull Request #21038 · ggml-org/llama.cpp

ggerganov · 2026-03-26T18:14:48Z

Overview

In anticipation of the incoming flood of vibe generated PRs implementing TurboQuant, I'm raising the baseline a bit using a very simple interpretation of the idea of using Hadamard transform to reduce outliers in the attention and improve the quantization quality:

Rotate input Q, K, V and store in cache
Perform attention with the rotated Q, K, V
Rotate back the output vector of the attention

This works because:

Rotation does not affect the dot product of 2 vectors
The output vector is a linear combination of rotated vectors

The implementation is very simple and backend agnostic - use Hadamard matrix of size n x n, normalized by 1/sqrt(n) so that it is orthonormal and can be used both for the forward and backward rotation. Technically any rotation matrix (and it's inverse) should work - I just think this is what is commonly used due to it's simplicity. The implementation does not introduce new types and is compatible with all existing quantizations. It adds 4 matrix multiplication operators in the attention, though I think some of them can be fused in the attention weights (similar to QuaRot).

I don't know what is the impact of the remaining techniques explained in TurboQuant (PolarQuant, QJL, etc.). They could be important and can potentially improve further on top of this. In any case, having a better baseline at almost 0 cost won't hurt. Only based on the PPL data below, I think this should never be worse that before, though it needs a bit more evaluation.

Note: MLA is not supported

Additional information

Here are some PPL results before and after using base models. Ideally, there should be KLD data too, but leaving it for people to play with it and see if it looks good.

./bin/llama-perplexity -m model.gguf -f wiki.test.raw --chunks 128 -ctk f16 -ctv f16

Model: Qwen3 0.6B BF16

https://huggingface.co/Qwen/Qwen3-0.6B-Base

Baseline F16 cache: PPL = 13.6711 +/- 0.21422

type	master	PR
q8_0	13.9115	13.6713
q5_1	61.6992	14.1452
q5_0	17.2805	14.2171
q4_1	212.479	22.2816
q4_0	62.0161	46.2503

Model: Qwen3 8B BF16

https://huggingface.co/Qwen/Qwen3-8B-Base

Baseline F16 cache: PPL = 7.3203 +/- 0.09901

type	master	PR
q8_0	7.3172	7.3195
q5_0	7.3793	7.3323
q4_0	7.6451	7.5012

Model: Gemma3 4B Q8_0

https://huggingface.co/google/gemma-3-4b-pt

Baseline F16 cache: PPL = 7.6905 +/- 0.10483

type	master	PR
q8_0	7.6928	7.6914
q5_1	7.7133	7.7052
q5_0	7.7544	7.7182
q4_1	7.8095	7.7531
q4_0	7.8535	7.7928

Model: Qwen3.5 4B F16

https://huggingface.co/Qwen/Qwen3.5-4B-Base

Baseline F16 cache: PPL = 8.3266 +/- 0.11623

type	master	PR
q8_0	8.3272	8.3261
q5_0	8.3359	8.3339
q4_0	8.3475	8.3448

TODOs

Cache shift support ff76c67
Disable rotations with env variable (LLAMA_ATTN_ROT_DISABLE)

Next PRs

Add randomized Hadamard matrices (f0fea26)

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

ubergarm · 2026-03-27T04:12:04Z

This PR does seem to improve PPL as compared to tip of master using -ctk q4_0 -ctv q4_0.

I ran everything CPU-only backend so as to also compare https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0 from @Aaryan-Kapoor to see how that specific tq3_0 implementation compares.

For reference:

f16 = 16 bpw
q4_0 = 4.5 bpw
tq3_0 = 3.5 bpw

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

k-cache	v-cache	baseline master@6861f6509	PR gg/attn-rot@e5aa067d6	Aaryan-Kapoor/turboquant-tq3_0@1fb1fb3ab
f16	f16	6.5788 +/- 0.04196, 688.81 tok/sec	6.5788 +/- 0.04196, 721.01 tok/sec
q4_0	q4_0	6.6148 +/- 0.04216, 703.91 tok/sec	6.5962 +/- 0.04208, 540.63 tok/sec
tq3_0	tq3_0			6.6911 +/- 0.04273, 481 tok/sec

👈 Details

Experiment

Measure perplexity against wiki.test.raw varying kv-cache quantization.

Test Quant

ubergarm/Qwen3.5-35B-A3B-GGUF Q4_0 19.776 GiB (4.901 BPW)

Test Rig

AMD EPYC 9975 128-Core w/ 12x64GiB DDR5@6400MT/s NPS1 Single Socket (~538 GB/s via mlc)
Compiled CPU-only backend

cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_VULKAN=OFF
cmake --build build --config Release -j $(nproc)

Command

# seed does nothing as no sampling here, but is fun
model=/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_0.gguf
    #-ctk f16 -ctv f16 \
numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
    -ctk q4_0 -ctv q4_0 \
    --seed 1337 \
    --ctx-size 512 \
    -ub 512 -b 2048 \
    --no-mmap \
    --numa numactl \
    --threads 96 \
    --threads-batch 128

AesSedai · 2026-03-27T08:16:49Z

I've run a pretty big permutation matrix for the Qwen3.5-9B model against the master branch and this PR branch.

The following images are the ratios vs F16/F16:

Master branch PPL and KLD

and this PR PPL and KLD

Here's the table of data comparing master vs this PR

k-cache	v-cache	master PPL	master mean KLD	attn-rot PPL	attn-rot KLD
f16	f16	8.195983 ± 0.055532	0.000782 ± 0.000040	8.195983 ± 0.055532	0.000782 ± 0.000040
f16	q4_0	8.208920 ± 0.055630	0.003747 ± 0.000182	8.208920 ± 0.055630	0.003747 ± 0.000182
f16	q4_1	8.212052 ± 0.055711	0.002985 ± 0.000140	8.212052 ± 0.055711	0.002985 ± 0.000140
f16	q5_0	8.206032 ± 0.055617	0.001752 ± 0.000111	8.206032 ± 0.055617	0.001752 ± 0.000111
f16	q5_1	8.195687 ± 0.055524	0.001393 ± 0.000111	8.195687 ± 0.055524	0.001393 ± 0.000111
f16	q8_0	8.196533 ± 0.055536	0.000804 ± 0.000037	8.196533 ± 0.055536	0.000804 ± 0.000037
q4_0	f16	8.212304 ± 0.055635	0.005426 ± 0.000271	8.212304 ± 0.055635	0.005426 ± 0.000271
q4_0	q4_0	8.227048 ± 0.055766	0.007979 ± 0.000352	8.211705 ± 0.055631	0.005059 ± 0.000224
q4_0	q4_1	8.223944 ± 0.055773	0.007163 ± 0.000298	8.209285 ± 0.055635	0.004664 ± 0.000196
q4_0	q5_0	8.217618 ± 0.055691	0.006165 ± 0.000307	8.201787 ± 0.055561	0.003708 ± 0.000187
q4_0	q5_1	8.210686 ± 0.055628	0.006011 ± 0.000332	8.201859 ± 0.055564	0.003656 ± 0.000203
q4_0	q8_0	8.209199 ± 0.055617	0.005336 ± 0.000272	8.198971 ± 0.055537	0.003334 ± 0.000160
q4_1	f16	8.219674 ± 0.055730	0.004174 ± 0.000239	8.219674 ± 0.055730	0.004174 ± 0.000239
q4_1	q4_0	8.232653 ± 0.055833	0.006612 ± 0.000263	8.213936 ± 0.055656	0.004702 ± 0.000223
q4_1	q4_1	8.235108 ± 0.055910	0.006014 ± 0.000266	8.210796 ± 0.055641	0.004252 ± 0.000183
q4_1	q5_0	8.229521 ± 0.055827	0.004827 ± 0.000253	8.204419 ± 0.055595	0.003456 ± 0.000191
q4_1	q5_1	8.221232 ± 0.055750	0.004739 ± 0.000252	8.203722 ± 0.055583	0.003297 ± 0.000171
q4_1	q8_0	8.219816 ± 0.055743	0.004005 ± 0.000209	8.201606 ± 0.055565	0.002968 ± 0.000156
q5_0	f16	8.202763 ± 0.055572	0.002223 ± 0.000157	8.202763 ± 0.055572	0.002223 ± 0.000157
q5_0	q4_0	8.213422 ± 0.055674	0.005025 ± 0.000243	8.211162 ± 0.055646	0.003654 ± 0.000244
q5_0	q4_1	8.217085 ± 0.055749	0.004181 ± 0.000185	8.206240 ± 0.055612	0.003055 ± 0.000165
q5_0	q5_0	8.214116 ± 0.055692	0.002849 ± 0.000143	8.200742 ± 0.055571	0.002062 ± 0.000124
q5_0	q5_1	8.202209 ± 0.055575	0.002657 ± 0.000170	8.200578 ± 0.055565	0.002028 ± 0.000141
q5_0	q8_0	8.201038 ± 0.055579	0.002310 ± 0.000168	8.197975 ± 0.055550	0.001553 ± 0.000095
q5_1	f16	8.195862 ± 0.055514	0.001843 ± 0.000165	8.195862 ± 0.055514	0.001843 ± 0.000165
q5_1	q4_0	8.208856 ± 0.055624	0.004611 ± 0.000216	8.210008 ± 0.055633	0.003729 ± 0.000258
q5_1	q4_1	8.210093 ± 0.055686	0.003773 ± 0.000176	8.206406 ± 0.055621	0.002889 ± 0.000152
q5_1	q5_0	8.207489 ± 0.055626	0.002754 ± 0.000220	8.200513 ± 0.055562	0.001825 ± 0.000103
q5_1	q5_1	8.195263 ± 0.055514	0.002626 ± 0.000226	8.200874 ± 0.055575	0.001821 ± 0.000121
q5_1	q8_0	8.194897 ± 0.055520	0.001957 ± 0.000178	8.198389 ± 0.055547	0.001750 ± 0.000182
q8_0	f16	8.197374 ± 0.055531	0.000923 ± 0.000075	8.197374 ± 0.055531	0.000923 ± 0.000075
q8_0	q4_0	8.209297 ± 0.055632	0.003642 ± 0.000159	8.208469 ± 0.055623	0.002935 ± 0.000204
q8_0	q4_1	8.210675 ± 0.055697	0.002920 ± 0.000116	8.202587 ± 0.055587	0.002519 ± 0.000149
q8_0	q5_0	8.207077 ± 0.055629	0.001715 ± 0.000095	8.197825 ± 0.055538	0.001263 ± 0.000053
q8_0	q5_1	8.196398 ± 0.055535	0.001484 ± 0.000113	8.200065 ± 0.055566	0.001187 ± 0.000055
q8_0	q8_0	8.196226 ± 0.055531	0.000866 ± 0.000048	8.196227 ± 0.055532	0.000757 ± 0.000039

And attached here is the full raw run data for every invocation:
analysis-results.zip

Dampfinchen · 2026-03-27T09:18:51Z

This PR does seem to improve PPL as compared to tip of master using -ctk q4_0 -ctv q4_0.

I ran everything CPU-only backend so as to also compare https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0 from @Aaryan-Kapoor to see how that specific tq3_0 implementation compares.

For reference:

f16 = 16 bpw

q4_0 = 4.5 bpw

tq3_0 = 3.5 bpw

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

k-cache v-cache baseline master@6861f6509 PR gg/attn-rot@e5aa067d6 Aaryan-Kapoor/turboquant-tq3_0@1fb1fb3ab
f16 f16 6.5788 +/- 0.04196, 688.81 tok/sec 6.5788 +/- 0.04196, 721.01 tok/sec
q4_0 q4_0 6.6148 +/- 0.04216, 703.91 tok/sec 6.5962 +/- 0.04208, 540.63 tok/sec
tq3_0 tq3_0 6.6911 +/- 0.04273, 481 tok/sec
👈 Details

Interesting results, which indicate that Q4_0, even before this PR, is superior to TQ_3. However as a word of caution, this is likely due to a very experimental and early implementation of this specific fork and not indicative of the actual performance of TurboQuant, as it is supposed to be effectively lossless compared to fp16 kv cache. That is not the case here.

ggerganov · 2026-03-27T09:32:58Z

I think we can actually rotate the V tensor using smaller matrices (64 x 64) which should result in better quality of the V cache. We cannot do that for the Q and K because it would not preserve the dot product.

Pushed a change to do that 832e326 and just from a quick PPL sanity check it looks slightly better.

Note, using 64 instead of 32 because the Metal matrix multiplication kernels require ne00 to be at least 64.

Edit:

We cannot do that for the Q and K because it would not preserve the dot product.

On second thought, I think it does preserve it. Something to try too. Here is the patch:

More rotations for Q and K

diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index 8dfc92b71..84bcf26be 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -2096,8 +2096,9 @@ static std::unique_ptr<llm_graph_input_attn_kv> build_attn_inp_kv_impl(
         const bool can_rot =
             !hparams.is_n_embd_k_gqa_variable() &&
             !hparams.is_n_embd_v_gqa_variable() &&
-            ggml_is_power_of_2(hparams.n_embd_head_k()) &&
+          //ggml_is_power_of_2(hparams.n_embd_head_k()) &&
           //ggml_is_power_of_2(hparams.n_embd_head_v()) &&
+            hparams.n_embd_head_k() % 64 == 0 &&
             hparams.n_embd_head_v() % 64 == 0 &&
             hparams.n_embd_head_k() >= 64 &&
             hparams.n_embd_head_v() >= 64 &&
@@ -2105,7 +2106,7 @@ static std::unique_ptr<llm_graph_input_attn_kv> build_attn_inp_kv_impl(
             ggml_is_quantized(mctx_cur->type_v());
 
         if (can_rot) {
-            const auto nk = hparams.n_embd_head_k();
+            const auto nk = 64;
             const auto nv = 64;
 
             inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nk, nk);
@@ -2453,8 +2454,9 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const
         const bool can_rot =
             !hparams.is_n_embd_k_gqa_variable() &&
             !hparams.is_n_embd_v_gqa_variable() &&
-            ggml_is_power_of_2(hparams.n_embd_head_k()) &&
+          //ggml_is_power_of_2(hparams.n_embd_head_k()) &&
           //ggml_is_power_of_2(hparams.n_embd_head_v()) &&
+            hparams.n_embd_head_k() % 64 == 0 &&
             hparams.n_embd_head_v() % 64 == 0 &&
             hparams.n_embd_head_k() >= 64 &&
             hparams.n_embd_head_v() >= 64 &&
@@ -2462,7 +2464,7 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const
             ggml_is_quantized(mctx_cur->get_base()->type_v());
 
         if (can_rot) {
-            const auto nk = hparams.n_embd_head_k();
+            const auto nk = 64;
             const auto nv = 64;
 
             inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nk, nk);

Rotatingxenomorph · 2026-03-27T10:18:24Z

This PR does seem to improve PPL as compared to tip of master using -ctk q4_0 -ctv q4_0.
I ran everything CPU-only backend so as to also compare https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0 from @Aaryan-Kapoor to see how that specific tq3_0 implementation compares.
For reference:

f16 = 16 bpw

q4_0 = 4.5 bpw

tq3_0 = 3.5 bpw

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

k-cache v-cache baseline master@6861f6509 PR gg/attn-rot@e5aa067d6 Aaryan-Kapoor/turboquant-tq3_0@1fb1fb3ab
f16 f16 6.5788 +/- 0.04196, 688.81 tok/sec 6.5788 +/- 0.04196, 721.01 tok/sec
q4_0 q4_0 6.6148 +/- 0.04216, 703.91 tok/sec 6.5962 +/- 0.04208, 540.63 tok/sec
tq3_0 tq3_0 6.6911 +/- 0.04273, 481 tok/sec
👈 Details

Interesting results, which indicate that Q4_0, even before this PR, is superior to TQ_3. However as a word of caution, this is likely due to a very experimental and early implementation of this specific fork and not indicative of the actual performance of TurboQuant, as it is supposed to be effectively lossless compared to fp16 kv cache. That is not the case here.

It can't approach turboquant's performance because it's missing the QJL part of turboquant. (I think?)

AesSedai · 2026-03-28T02:17:33Z

I re-ran the suite against the updated new commit, looks like the KLD has generally improved a little bit across the board!

Heatmap results:

Details data table

k-cache	v-cache	PPL	mean KLD
f16	f16	8.195983 ± 0.055532	0.000782 ± 0.000040
f16	q4_0	8.205849 ± 0.055607	0.002518 ± 0.000094
f16	q4_1	8.206960 ± 0.055627	0.002308 ± 0.000088
f16	q5_0	8.197224 ± 0.055540	0.001275 ± 0.000052
f16	q5_1	8.198303 ± 0.055550	0.001126 ± 0.000043
f16	q8_0	8.196163 ± 0.055536	0.000838 ± 0.000058
q4_0	f16	8.202777 ± 0.055567	0.003475 ± 0.000182
q4_0	q4_0	8.208411 ± 0.055610	0.004782 ± 0.000194
q4_0	q4_1	8.210197 ± 0.055641	0.004592 ± 0.000184
q4_0	q5_0	8.199608 ± 0.055539	0.003880 ± 0.000205
q4_0	q5_1	8.200301 ± 0.055547	0.003644 ± 0.000190
q4_0	q8_0	8.199085 ± 0.055545	0.003346 ± 0.000183
q4_1	f16	8.202827 ± 0.055570	0.003155 ± 0.000182
q4_1	q4_0	8.214026 ± 0.055670	0.004670 ± 0.000214
q4_1	q4_1	8.213453 ± 0.055686	0.004174 ± 0.000173
q4_1	q5_0	8.200692 ± 0.055565	0.003446 ± 0.000189
q4_1	q5_1	8.199583 ± 0.055558	0.003457 ± 0.000206
q4_1	q8_0	8.200708 ± 0.055569	0.002958 ± 0.000168
q5_0	f16	8.199265 ± 0.055551	0.001553 ± 0.000099
q5_0	q4_0	8.207457 ± 0.055617	0.003247 ± 0.000161
q5_0	q4_1	8.208826 ± 0.055647	0.002947 ± 0.000144
q5_0	q5_0	8.200150 ± 0.055558	0.002054 ± 0.000158
q5_0	q5_1	8.199639 ± 0.055568	0.001954 ± 0.000120
q5_0	q8_0	8.197875 ± 0.055554	0.001521 ± 0.000093
q5_1	f16	8.200392 ± 0.055557	0.001585 ± 0.000137
q5_1	q4_0	8.210025 ± 0.055650	0.003052 ± 0.000151
q5_1	q4_1	8.207874 ± 0.055632	0.002765 ± 0.000128
q5_1	q5_0	8.200399 ± 0.055565	0.001849 ± 0.000126
q5_1	q5_1	8.199001 ± 0.055561	0.001839 ± 0.000146
q5_1	q8_0	8.199596 ± 0.055562	0.001544 ± 0.000125
q8_0	f16	8.196933 ± 0.055526	0.000837 ± 0.000047
q8_0	q4_0	8.205234 ± 0.055600	0.002568 ± 0.000141
q8_0	q4_1	8.206657 ± 0.055621	0.002230 ± 0.000084
q8_0	q5_0	8.197552 ± 0.055547	0.001229 ± 0.000046
q8_0	q5_1	8.197543 ± 0.055546	0.001300 ± 0.000117
q8_0	q8_0	8.196556 ± 0.055535	0.000799 ± 0.000042

Full run archives:
attn-rot-7711b3a36.zip

Dampfinchen · 2026-03-28T15:51:36Z

I re-ran the suite against the updated new commit, looks like the KLD has generally improved a little bit across the board!

Heatmap results:

Details data table
k-cache v-cache PPL mean KLD
f16 f16 8.195983 ± 0.055532 0.000782 ± 0.000040
f16 q4_0 8.205849 ± 0.055607 0.002518 ± 0.000094
f16 q4_1 8.206960 ± 0.055627 0.002308 ± 0.000088
f16 q5_0 8.197224 ± 0.055540 0.001275 ± 0.000052
f16 q5_1 8.198303 ± 0.055550 0.001126 ± 0.000043
f16 q8_0 8.196163 ± 0.055536 0.000838 ± 0.000058
q4_0 f16 8.202777 ± 0.055567 0.003475 ± 0.000182
q4_0 q4_0 8.208411 ± 0.055610 0.004782 ± 0.000194
q4_0 q4_1 8.210197 ± 0.055641 0.004592 ± 0.000184
q4_0 q5_0 8.199608 ± 0.055539 0.003880 ± 0.000205
q4_0 q5_1 8.200301 ± 0.055547 0.003644 ± 0.000190
q4_0 q8_0 8.199085 ± 0.055545 0.003346 ± 0.000183
q4_1 f16 8.202827 ± 0.055570 0.003155 ± 0.000182
q4_1 q4_0 8.214026 ± 0.055670 0.004670 ± 0.000214
q4_1 q4_1 8.213453 ± 0.055686 0.004174 ± 0.000173
q4_1 q5_0 8.200692 ± 0.055565 0.003446 ± 0.000189
q4_1 q5_1 8.199583 ± 0.055558 0.003457 ± 0.000206
q4_1 q8_0 8.200708 ± 0.055569 0.002958 ± 0.000168
q5_0 f16 8.199265 ± 0.055551 0.001553 ± 0.000099
q5_0 q4_0 8.207457 ± 0.055617 0.003247 ± 0.000161
q5_0 q4_1 8.208826 ± 0.055647 0.002947 ± 0.000144
q5_0 q5_0 8.200150 ± 0.055558 0.002054 ± 0.000158
q5_0 q5_1 8.199639 ± 0.055568 0.001954 ± 0.000120
q5_0 q8_0 8.197875 ± 0.055554 0.001521 ± 0.000093
q5_1 f16 8.200392 ± 0.055557 0.001585 ± 0.000137
q5_1 q4_0 8.210025 ± 0.055650 0.003052 ± 0.000151
q5_1 q4_1 8.207874 ± 0.055632 0.002765 ± 0.000128
q5_1 q5_0 8.200399 ± 0.055565 0.001849 ± 0.000126
q5_1 q5_1 8.199001 ± 0.055561 0.001839 ± 0.000146
q5_1 q8_0 8.199596 ± 0.055562 0.001544 ± 0.000125
q8_0 f16 8.196933 ± 0.055526 0.000837 ± 0.000047
q8_0 q4_0 8.205234 ± 0.055600 0.002568 ± 0.000141
q8_0 q4_1 8.206657 ± 0.055621 0.002230 ± 0.000084
q8_0 q5_0 8.197552 ± 0.055547 0.001229 ± 0.000046
q8_0 q5_1 8.197543 ± 0.055546 0.001300 ± 0.000117
q8_0 q8_0 8.196556 ± 0.055535 0.000799 ± 0.000042
Full run archives: attn-rot-7711b3a36.zip

What about higher context, like for example 100K? I believe that's where kv cache quantization actually harms the model from multiplications errors that accumulated.

ggerganov · 2026-03-28T16:24:54Z

@AesSedai Thanks for the results. It seems important to track the KLD rather than PPL (maybe more significant for Qwen3.5).

I did some additional tests with randomized Hadamard matrices (as suggested by @sashkboos in #6444 (comment)). Will follow-up in next PRs.

CISC · 2026-03-28T20:05:35Z

src/llama-kv-cache.cpp

+
+    auto * data = (float *) tensor->data;
+
+    data[0*n + 0] = 1.0 / sqrtf(n);


This always gets me, at least this one makes esthetic sense. :)

am17an · 2026-03-29T06:40:20Z

There is a slowdown which is expected, however probably we should have a flag to opt out?

GGML_CUDA=ON ./scripts/compare-commits.sh upstream/master gg/attn-rot llama-bench -ctk q4_0 -ctv q4_0 -m /opt/models/qwen_3-30b3a-q4_0.gguf -fa 1 -d 0,2048,4096,8192,16384,32768

Model	Test	t/s `51a84ef`	t/s gg/attn-rot	Speedup
qwen3moe 30B.A3B Q4_0	pp512	7603.68	7409.35	0.97
qwen3moe 30B.A3B Q4_0	pp512@d2048	7085.71	6912.28	0.98
qwen3moe 30B.A3B Q4_0	pp512@d4096	6586.79	6449.26	0.98
qwen3moe 30B.A3B Q4_0	pp512@d8192	5844.15	5694.56	0.97
qwen3moe 30B.A3B Q4_0	pp512@d16384	4704.12	4612.89	0.98
qwen3moe 30B.A3B Q4_0	pp512@d32768	3398.99	3346.66	0.98
qwen3moe 30B.A3B Q4_0	tg128	229.42	202.37	0.88
qwen3moe 30B.A3B Q4_0	tg128@d2048	210.24	188.03	0.89
qwen3moe 30B.A3B Q4_0	tg128@d4096	197.10	177.06	0.90
qwen3moe 30B.A3B Q4_0	tg128@d8192	175.60	159.45	0.91
qwen3moe 30B.A3B Q4_0	tg128@d16384	145.26	134.28	0.92
qwen3moe 30B.A3B Q4_0	tg128@d32768	101.33	94.13	0.93

handpickencounter · 2026-03-29T06:47:15Z

I don't know what is the impact of the remaining techniques explained in TurboQuant (PolarQuant, QJL, etc.). They could be important and can potentially improve further on top of this.

PolarQuant is just a rotation by a random matrix to spread the energy of any outliers across all dimensions prior to quantization. A normalized Hadamard matrix may or may not produce similar results.

QJL is extremely important however, PolarQuant (and your Hadamard) gives biased attention scores (despite the good seeming MSE) which results in rank flips. QJL uses the 'unreasonable effectiveness of random projections' to give unbiased results. The theory (more accurately, the lemma) is that randomly projecting a vector to lower dimensions will preserve pairwise distances (and therefore dot products) in expectation. TurboQuant quantizes the error residual all the way to 1-bit (the sign) which somehow works well enough.

The best technical yet intuitive explanation I saw on the subject so far is by some AI Researcher from Amazon: https://darshanfofadiya.com/research-papers/turboquant/

CISC · 2026-03-29T07:56:25Z

There is a slowdown which is expected, however probably we should have a flag to opt out?

I don't think we should, the only reason to quantize kv-cache is to save memory, if that comes at the cost of speed, so be it, an option to reduce quality does not make sense (unless you think this is not a general quality improvement).

am17an · 2026-03-29T08:10:05Z

My opinion is that any new breaking change should have an option to turn it off, in case of any unforeseen issues. This change changes outputs of the LLM (even if they are materially better), so just as an example someone having tests downstream would see tests break.

the only reason to quantize kv-cache is to save memory

I think when using -nkvo, it will be faster to transfer a quantized cache to the GPU when offloading, so it can be used for a performance reason also. Also the CUDA fattn vec kernel uses dp4a for q8_0 which is 2x the throughput vs f16. So that statement is just wrong IMO.

CISC · 2026-03-29T08:42:40Z

My opinion is that any new breaking change should have an option to turn it off, in case of any unforeseen issues.

You could say we are in the business of making breaking changes and that git bisect is our option, but I digress. :)

If anything, I guess adding an env-var for some grace period would be sufficient.

This change changes outputs of the LLM (even if they are materially better), so just as an example someone having tests downstream would see tests break.

Well, until you pointed out the below I would have said that no such test were likely to exist. :)

I think when using -nkvo, it will be faster to transfer a quantized cache to the GPU when offloading, so it can be used for a performance reason also. Also the CUDA fattn vec kernel uses dp4a for q8_0 which is 2x the throughput vs f16. So that statement is just wrong IMO.

Dampfinchen · 2026-03-29T08:53:44Z

There is a slowdown which is expected, however probably we should have a flag to opt out?

I don't think we should, the only reason to quantize kv-cache is to save memory, if that comes at the cost of speed, so be it, an option to reduce quality does not make sense (unless you think this is not a general quality improvement).

I disagree. You could use quantized KV Cache to save memory in order to put more layers on the GPU. In that case, the purpose of it would be to increase speed and this change would be counterproductive to that goal.

CISC · 2026-03-29T08:58:19Z

There is a slowdown which is expected, however probably we should have a flag to opt out?

I don't think we should, the only reason to quantize kv-cache is to save memory, if that comes at the cost of speed, so be it, an option to reduce quality does not make sense (unless you think this is not a general quality improvement).

I disagree. You could use quantized KV Cache to save memory in order to put more layers on the GPU. In that case, the purpose of it would be to increase speed and this change would be counterproductive to that goal.

You'd most likely gain much more than you lose, so calling it counterproductive is perhaps a bit far fetched.

am17an · 2026-03-29T10:05:34Z

If anything, I guess adding an env-var for some grace period would be sufficient.

Actually I'm not sure it would be worth it to be on by default for q8_0 since the benefits are much lesser compared to q4 and the performance loss is significant

Details

Model	Test	t/s `51a84ef`	t/s gg/attn-rot	Speedup
qwen3moe 30B.A3B Q4_0	pp512	7576.02	7372.24	0.97
qwen3moe 30B.A3B Q4_0	pp512@d2048	7045.27	6838.27	0.97
qwen3moe 30B.A3B Q4_0	pp512@d4096	6580.31	6435.85	0.98
qwen3moe 30B.A3B Q4_0	pp512@d8192	5792.41	5661.65	0.98
qwen3moe 30B.A3B Q4_0	pp512@d16384	4660.49	4584.87	0.98
qwen3moe 30B.A3B Q4_0	pp512@d32768	3349.27	3300.92	0.99
qwen3moe 30B.A3B Q4_0	tg128	230.59	203.59	0.88
qwen3moe 30B.A3B Q4_0	tg128@d2048	212.31	189.41	0.89
qwen3moe 30B.A3B Q4_0	tg128@d4096	199.79	179.40	0.90
qwen3moe 30B.A3B Q4_0	tg128@d8192	179.48	162.80	0.91
qwen3moe 30B.A3B Q4_0	tg128@d16384	149.03	137.47	0.92
qwen3moe 30B.A3B Q4_0	tg128@d32768	105.50	98.55	0.93

CISC · 2026-03-29T11:04:38Z

If anything, I guess adding an env-var for some grace period would be sufficient.

Actually I'm not sure it would be worth it to be on by default for q8_0 since the benefits are much lesser compared to q4 and the performance loss is significant

Granted, it's probably pointless for q8_0, so perhaps add a check.

ggerganov · 2026-03-29T15:48:25Z

It seems that evaluating AIME25 could be another sensitivity test for confirming the improvement of rotating the activations. I did a few runs today with gpt-oss-20b on low reasoning (for faster evaluation):

Here is the table with the hyperlinks preserved for each score:

eval	KV type	score (no rot)	score (rot)
AIME25 x8	F16	37.9%	—
AIME25 x8	Q8_0	31.7%	37.1%
AIME25 x8	Q5_1	30.8%	32.5%
AIME25 x8	Q5_0	25.4%	32.5%
AIME25 x8	Q4_1	18.3%	28.3%
AIME25 x8	Q4_0	2.0%	21.7%

Used the new llama-eval script to perform these evals: #21152

It's interesting that with Q4_0 KV type without applying rotations, the model completely breaks down - it cannot solve even 1 problem from the 240 attempts. Note that it is still reasoning (i.e. it's not producing garbage outputs). The final answers and logic are just incorrect.

Example of reasoning trace with KV type `Q4_0` without rotations

Regarding Q8_0 - I think there is some non-negligible quality benefit of applying the rotation here too (31.7% -> 37.1%). Though we need a bit more stats to confirm that. Atm, I am not convinced that disabling the rotations for Q8_0 is the better option, but I'm still considering it.

For reference, the expected score of this eval per the gpt-oss model card is 37.1%:

https://arxiv.org/pdf/2508.10925

Dampfinchen · 2026-03-29T17:23:42Z

It seems that evaluating AIME25 could be another sensitivity test for confirming the improvement of rotating the activations. I did a few runs today with gpt-oss-20b on low reasoning (for faster evaluation):

eval KV type rot score results (HTML)
AIME25 x8 F16 no 37.9% aime2025-gpt-oss-20b-low-x8-kv_f16.json.html
AIME25 x8 Q8_0 no 31.7% aime2025-gpt-oss-20b-low-x8-kv_q8_0.json.html
AIME25 x8 Q8_0 yes 37.1% aime2025-gpt-oss-20b-low-x8-kv_q8_0-rot.json.html
AIME25 x8 Q4_0 no 0.0% aime2025-gpt-oss-20b-low-x8-kv_q4_0.json.html
AIME25 x8 Q4_0 yes 21.7% aime2025-gpt-oss-20b-low-x8-kv_q4_0-rot.json.html
Used the new llama-eval script to perform these evals: #21152

It's interesting that with Q4_0 KV type without applying rotations, the model completely breaks down - it cannot solve even 1 problem from the 240 attempts. Note that it is still reasoning (i.e. it's not producing garbage outputs). The final answers and logic are just incorrect.

Example of reasoning trace with KV type Q4_0 without rotations

Regarding Q8_0 - I think there is some non-negligible quality benefit of applying the rotation here too (31.7% -> 37.1%). Though we need a bit more stats to confirm that. Atm, I am not convinced that disabling the rotations for Q8_0 is the better option, but I'm still considering it.

For reference, the expected score of this eval per the gpt-oss model card is 37.1%:
https://arxiv.org/pdf/2508.10925

Yeah, based on the score, there is a clear benefit from rotating activations for q8_0, and I think it makes sense to have them on by default as the purpose of Q8_0 is to save memory for KV Cache for the same quality and with attn-rot it is noticeably closer to that goal. Furthermore, I'm not sure if AIME25 tests at long context as well. Quality differences might become even more noticeable at a longer context.

Performance is a concern, however. In the end, I think the best course of action would be to make attn-rots an opt-out/opt-in option with a simple flag.

Add MXFP (Microscaling Floating Point) KV cache types for flash attention: - MXFP4 (E2M1, 4-bit), MXFP6 (E2M3, 6-bit), MXFP8 (E4M3, 8-bit) - SoA layout for flash attention: [qs_blocks | e8m0_scales] per head - Element converters with round-to-nearest for E8M0 and mantissa - FP6 pack/unpack (4 elements in 3 bytes) Integrate with upstream's graph-level rotation infrastructure (ggml-org#21038): - MXFP K types use 32x32 rotation matrix (block-aligned with MXFP groups) - Matrix contains D*H: Davis-Jedwab zigzag sign diagonal (0x3C5A6600) composed with Walsh-Hadamard, generated at init - ggml_mul_mat_aux reshapes to blocks of 32 -> applies D*H per MXFP group - V rotation disabled for MXFP (hurts well-conditioned models) - No rotation code in set_rows or FA kernel — graph handles it CPU flash attention path: - SoA set_rows: quantizes pre-rotated K/V to SoA MXFP format - FA kernel: dequantizes SoA K/V to F32 for dot product/accumulation - Q is already D*H-rotated by graph — used directly in F32

nawoa · 2026-04-01T20:27:31Z

I'm excited to try this out but at the same time I'm slightly concerned that - unless I missed it, in which case I apologize - throughout this whole PR there were no tests done on long contexts and mainly just speed-related testing done on anything larger than 9B. I admit I'm relatively new to this field but shouldn't larger models (>70B) and longer contexts (>128K) be tested more before merging?

Rotatingxenomorph · 2026-04-01T21:02:45Z

I'm excited to try this out but at the same time I'm slightly concerned that - unless I missed it, in which case I apologize - throughout this whole PR there were no tests done on long contexts and mainly just speed-related testing done on anything larger than 9B. I admit I'm relatively new to this field but shouldn't larger models (>70B) and longer contexts (>128K) be tested more before merging?

I did mention this but I got put in the naughty corner.

Applies Walsh-Hadamard rotation to Q, K, V activations before KV cache quantization, dramatically improving quality for all quantized KV types. Ported from ggml-org/llama.cpp#21038 (merged 2026-04-01) with fixes for PrismML's API differences (n_embd_head_k as member vs function). Key changes: - Pre-computed Hadamard matrices stored in llama_kv_cache - Q/K rotated with largest power-of-2 matrix dividing head_dim - V rotated with fixed 64x64 matrix (empirically better) - Attention output inverse-rotated after computation - RoPE shift gets rotation sandwich (inverse before, forward after) - LLAMA_ATTN_ROT_DISABLE=1 env var to opt out - Auto-enabled when KV type is quantized and head_dim % 64 == 0 - TBQ4_0 also gets rotation (double rotation with internal FWHT) Results with q4_0 KV: 48.6 t/s gen (vs 34.4 t/s TBQ4_0), coherent output.

EAddario · 2026-04-01T21:33:09Z

YOLO

pwilkin · 2026-04-01T22:56:47Z

@nawoa FWIW I just ran a quick (by quick I mean took an hour) 7 cases from InfiniteBench on Qwen3.5 9B with Q8 quants, got 7/7 (InfiniteBench tests needle-in-a-haystack with contexts over 128k).

nawoa · 2026-04-01T23:19:16Z

If someone wants to give me a command line with whatever arguments necessary, I can run a more comprehensive test overnight on my RTX 6000 Pro 96GB with the model of your choice. I'll be going to bed in 3-4 hours.

pwilkin · 2026-04-01T23:33:59Z

@nawoa all's there:

https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/infinite_bench/

To use a local model, you have to:
-> run the server preferrably with a model alias, such as -a local
-> set environment variables LOCAL_API_KEY and LOCAL_BASE_URL
-> run with --model openai-api/local/local (or whatever alias you used)

The way inspect-ai works is that it maps whatever you put as the second part of the model identifier to the uppercase prefix before _BASE_URL and _API_KEY, so you can use openai-=api/foobar/local with FOOBAR_BASE_URL for example.

JeroenAdam · 2026-04-02T09:38:58Z

Not showing up in the releases because of a failing test in CI: https://github.com/ggml-org/llama.cpp/actions/runs/23837492532/job/69484676982

CISC · 2026-04-02T09:46:41Z

Not showing up in the releases because of a failing test in CI: https://github.com/ggml-org/llama.cpp/actions/runs/23837492532/job/69484676982

Get b8624 or later instead, there was an intermittent failure for a few releases.

erazortt · 2026-04-02T16:39:35Z

ok so I tried @pwilkin's suggested benchmark. I ran the first 100 tests of infinite_bench_longbook_choice_eng using Qwen3.5 35B in instruct mode on a Blackwell 5000 (48GB VRAM). Unfornunately it turns out this test is not sensitive enough to cache quantization: all quants yielded the same results (inside the error bars).

pwilkin · 2026-04-02T16:40:44Z

@erazortt you can parametrize the test to make it harder, for example, add more needles.

Merges ggml-org/llama.cpp upstream (d23355a..7992aa7) including: - Gemma 4 model support (PR ggml-org#21309) - KV cache rotation for better quantization (ggml-org#21038) - Auto GPU memory fitting (llama_params_fit) - Many new model architectures (Qwen3.5, Kimi K2, LFM2, etc.) C++14/CUDA 7.5 compatibility fixes applied to merged code: - Replaced if constexpr with runtime if across CUDA files - Replaced constexpr __device__ functions with macros - Replaced structured bindings with .first/.second access - Replaced std::string_view/std::optional with std::string - Template specializations for ggml_cuda_cast (convert.cuh) - BF16 flash attention guarded behind CUDART_VERSION >= 11000 - Eager CUDA context init restored for accurate VRAM on non-VMM GPUs - Jinja C++17 structured bindings fixed (caused Qwen 3.5 segfault) Build system updates: - Added hf-cache-stub.cpp, server-tools-stub.cpp for C++14 compat - Added mtmd-image.cpp, httplib.cpp to build - convert_hf_to_gguf.py patched for PyTorch 1.13 compatibility - gguf vocab.py fallback for old tokenizers library Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rg#21038) Upstream master now applies Walsh-Hadamard rotation to K/V/Q before KV cache storage (commit 744c0c7). This is the same rotation TBQ was doing independently, causing double rotation after the merge. TBQ types are now pure codebook quantizers: - SET_ROWS: normalize + codebook quantize + pack (no FWHT) - FA dequant: codebook lookup + scale (no Q pre-rotation, no V inverse rotation) - Standalone dequant: codebook lookup + scale (no inverse FWHT) Removes ~200 lines of rotation code from CUDA and CPU paths. Fixes garbage output caused by double WHT rotation after upstream merge.

- Register GGML_TYPE_TURBO3_0 and GGML_TYPE_TURBO4_0 in kv_cache_types so --cache-type-k turbo3 / --cache-type-v turbo3 are recognized - Fix double V un-rotation: upstream PR ggml-org#21038 (attn_rot_v) already handles Hadamard rotation for quantized KV cache types including turbo. Make TurboQuant WHT fallback only when upstream rotation is not active (else if instead of sequential if blocks)

henk717 mentioned this pull request Mar 27, 2026

[FEATURE REQUEST] - Turbo Quant LostRuins/koboldcpp#2075

Closed

This was referenced Mar 27, 2026

V-cache Hadamard transform ikawrakow/ik_llama.cpp#1527

Merged

TurboQuant KV Cache Compression — Working Implementation Ready for Review ikawrakow/ik_llama.cpp#1509

Open

jesusmb1995 mentioned this pull request Mar 28, 2026

TurboQuant: KV cache quantization with Hadamard transform (TBQ3_0 / TBQ4_0) tetherto/qvac-fabric-llm.cpp#115

Draft

23 tasks

ggerganov force-pushed the gg/attn-rot branch from 7711b3a to 5e60035 Compare March 28, 2026 16:19

ggerganov marked this pull request as ready for review March 28, 2026 16:21

ggerganov requested a review from CISC as a code owner March 28, 2026 16:21

CISC approved these changes Mar 28, 2026

View reviewed changes

This comment was marked as off-topic.

Sign in to view

CISC mentioned this pull request Mar 29, 2026

Support for DeepseekV32ForCausalLM with DeepSeek Sparse Attention (DSA) #21149

Draft

ggerganov force-pushed the gg/attn-rot branch from 5e60035 to e05a504 Compare March 29, 2026 15:22

ggerganov added 3 commits April 1, 2026 10:25

cont : add env variable to disable rotation

29f4196

cont : simplify attn rot kv cache logic + rename env

8df24c0

cont : pre-compute the Hadamard matrices

a0c4a2a

ggerganov force-pushed the gg/attn-rot branch from c35f75d to a0c4a2a Compare April 1, 2026 07:36

ggerganov merged commit 744c0c7 into master Apr 1, 2026
46 of 47 checks passed

ggerganov deleted the gg/attn-rot branch April 1, 2026 13:58

ggerganov mentioned this pull request Apr 1, 2026

kv-cache : do not quantize SWA KV cache #21277

Merged

github-actions bot mentioned this pull request Apr 2, 2026

Reddit News Daily 2026-04-02 gitlawr/reddit-daily-news#202

Open

am17an mentioned this pull request Apr 3, 2026

Feature Request: Implement Fast Walsh Hadamard Transform #21352

Open

4 tasks

cutlerbenjamin1-cmd mentioned this pull request Apr 3, 2026

Eval bug: Qwen3.5-27B CUDA illegal memory access regression in prompt cache (single GPU, agentic tool-call pattern) #21383

Open

mihai-chiorean mentioned this pull request Apr 4, 2026

Feature Request: TurboQuant support #20977

Open

4 tasks


		auto * data = (float *) tensor->data;

		data[0*n + 0] = 1.0 / sqrtf(n);

Conversation

ggerganov commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Model: Qwen3 0.6B BF16

Model: Qwen3 8B BF16

Model: Gemma3 4B Q8_0

Model: Qwen3.5 4B F16

TODOs

Next PRs

Requirements

Uh oh!

ubergarm commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

Experiment

Test Quant

Test Rig

Command

Uh oh!

AesSedai commented Mar 27, 2026

Uh oh!

Dampfinchen commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

Uh oh!

ggerganov commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rotatingxenomorph commented Mar 27, 2026

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

Uh oh!

AesSedai commented Mar 28, 2026

Uh oh!

Dampfinchen commented Mar 28, 2026

Uh oh!

ggerganov commented Mar 28, 2026

Uh oh!

CISC Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

am17an commented Mar 29, 2026

Uh oh!

handpickencounter commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Mar 29, 2026

Uh oh!

am17an commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

CISC commented Mar 29, 2026

Uh oh!

Dampfinchen commented Mar 29, 2026

Uh oh!

CISC commented Mar 29, 2026

Uh oh!

am17an commented Mar 29, 2026

Uh oh!

CISC commented Mar 29, 2026

Uh oh!

ggerganov commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dampfinchen commented Mar 29, 2026

Uh oh!

Uh oh!

nawoa commented Apr 1, 2026

Uh oh!

Rotatingxenomorph commented Apr 1, 2026

Uh oh!

EAddario commented Apr 1, 2026

ggerganov commented Mar 26, 2026 •

edited

Loading

ubergarm commented Mar 27, 2026 •

edited

Loading

Dampfinchen commented Mar 27, 2026 •

edited

Loading

ggerganov commented Mar 27, 2026 •

edited

Loading

handpickencounter commented Mar 29, 2026 •

edited

Loading

am17an commented Mar 29, 2026 •

edited

Loading

ggerganov commented Mar 29, 2026 •

edited

Loading