V-cache Hadamard transform by ikawrakow · Pull Request #1527 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-03-27T10:31:40Z

This PR adds the ability to also use Hadamard transforms for the V cache. As with the K-cache, V-head size must be 64, 128 or 256 (so not applicable to MLA models such as Kimi or GLM-5).

Unlike K-cache, very modest accuracy gains as measured by PPL. Hence, comparisons are left to the curious user.

But, as mainline is "raising the bar" 3.5 months after ik_llama.cpp (see #1033, #1034), I decided to add the ability to use Hadamard-transformed V-cache to avoid any potential claims that mainline's approach is somehow better.

To use Hadamard-transformed V-cache, add -vhad | --v-cache-hadamard to the command line. The V cache must be quantized for that to make sense, which requires flash attention to be turned on.

hksdpc255 · 2026-03-27T11:16:15Z

I remember you mentioned that the Hadamard transform can achieve more precise results. Why is it not enabled by default? What are the trade-offs?

Ph0rk0z · 2026-03-27T14:21:38Z

It's slightly slower at inference.

ikawrakow · 2026-03-27T14:33:25Z

I remember you mentioned that the Hadamard transform can achieve more precise results. Why is it not enabled by default? What are the trade-offs?

The main reason it is not on by default is that the code making all necessary checks has not been written. The attention head size must be a power of two, the cache must be quantized, the model cannot use MLA attention, etc. My thinking is that a user who on purpose turns on this feature has understood the use case scenarios and usage constrains (although, judging by some of the issues being posted, the sad reality is that the average user is just copy-pasting commands they have found around the Internet).

Also, turning on Hadamard transforms comes with a small performance penalty, so that may not be what the users obsessed with performance would want.

The good news is that now with all the TurboQuant hype, many people will learn that one can apply a rotation to the KV cash data, so perhaps usage of this feature will become more common and better informed.

ikawrakow · 2026-03-27T14:35:30Z

It's slightly slower at inference.

Well, yes, it need 2 extra operations. That's why it is a separate command line argument so that the better informed user has the option to not use it, given the small performance penalty and very minor quality benefit.

Ph0rk0z · 2026-03-27T15:26:33Z

For those big dense models, it has basically been nothing so I just turn it on, even with Q8 cache. Oh and someone is trying to use TurboQuant to quantize entire models. Think there is any merit to it?

ikawrakow · 2026-03-27T15:47:25Z

For those big dense models, it has basically been nothing so I just turn it on, even with Q8 cache. Oh and someone is trying to use TurboQuant to quantize entire models. Think there is any merit to it?

It will be just like any other hype. In a week or two nobody will remember it, and everybody will be chasing the next hype wave.

MrHills-rs · 2026-03-28T21:53:13Z

No real performance loss from normal q8, but better PPL then f16. With ram speed bottleneck at least.

PPL at 65536 ctx
Q8:
[1]2.2262, [2]3.0139, [3]3.0594, [4]3.3343,
Final estimate: PPL over 4 chunks for n_ctx=65536 = 3.3343 +/- 0.01853
F16:
[1]2.2259, [2]3.0124, [3]3.0582, [4]3.3334,
Final estimate: PPL over 4 chunks for n_ctx=65536 = 3.3334 +/- 0.01852
Q8 + khad + vhad
[1]2.2251, [2]3.0108, [3]3.0570, [4]3.3322,
Final estimate: PPL over 4 chunks for n_ctx=65536 = 3.3322 +/- 0.01851

n_kv_max = 65536
n_batch = 4096
n_ubatch = 4096
flash_attn = 1
n_gpu_layers = 95
n_threads = 7
n_threads_batch = 8

With KHAD + VHAD

Prompt tokens (PP)	Generated tokens (TG)	KV Cache size (N_KV)	PP Time (s)	PP Speed (t/s)	TG Time (s)	TG Speed (t/s)
4096	128	0	6.139	667.25	11.323	11.30
4096	128	4096	6.139	667.18	11.394	11.23
4096	128	8192	6.204	660.18	11.398	11.23
4096	128	12288	6.279	652.36	11.418	11.21
4096	128	16384	6.338	646.24	11.424	11.20
4096	128	20480	6.324	647.71	11.434	11.19
4096	128	24576	6.381	641.88	11.435	11.19
4096	128	28672	6.396	640.37	11.452	11.18
4096	128	32768	6.484	631.75	11.517	11.11
4096	128	36864	6.525	627.74	11.554	11.08
4096	128	40960	6.534	626.89	11.567	11.07
4096	128	45056	6.614	619.29	11.606	11.03
4096	128	49152	6.655	615.52	11.654	10.98
4096	128	53248	6.699	611.39	11.684	10.95
4096	128	57344	6.732	608.39	11.718	10.92
4096	128	61440	6.804	601.99	11.738	10.91

No khad, no vhad

Prompt tokens (PP)	Generated tokens (TG)	KV Cache size (N_KV)	PP Time (s)	PP Speed (t/s)	TG Time (s)	TG Speed (t/s)
4096	128	0	6.126	668.60	11.356	11.27
4096	128	4096	6.133	667.89	11.365	11.26
4096	128	8192	6.195	661.17	11.378	11.25
4096	128	12288	6.269	653.33	11.401	11.23
4096	128	16384	6.328	647.32	11.436	11.19
4096	128	20480	6.314	648.74	11.462	11.17
4096	128	24576	6.370	643.00	11.429	11.20
4096	128	28672	6.390	641.02	11.507	11.12
4096	128	32768	6.469	633.22	11.522	11.11
4096	128	36864	6.510	629.17	11.546	11.09
4096	128	40960	6.534	626.88	11.548	11.08
4096	128	45056	6.595	621.12	11.617	11.02
4096	128	49152	6.641	616.75	11.625	11.01
4096	128	53248	6.685	612.71	11.659	10.98
4096	128	57344	6.728	608.77	11.730	10.91
4096	128	61440	6.797	602.61	11.794	10.85

tomByrer · 2026-03-29T06:36:29Z

With KHAD + VHAD

Seems not that much of an improvement?
Either way, thanks for the charts!

ikawrakow · 2026-03-29T07:11:24Z

With KHAD + VHAD

Seems not that much of an improvement? Either way, thanks for the charts!

Q8_0 quantized KV cache is basically lossless, so not much to improve there. One benefits from -khad and -vhad for KV cache quantized with less than 8 bits.

MrHills-rs · 2026-03-29T08:41:35Z

With KHAD + VHAD

Seems not that much of an improvement? Either way, thanks for the charts!

Well that's just PPL. When it's really high you can use it to understand that something is wrong, and when it's higher in a certain configuration you can guesstimate that that configuration is probably better for your use case.

But how much better isn't something you can estimate by a PPL delta. At the end of the day you're predicting Wikipedia text without even an instruct format. Has nothing to do with 99% of use cases. And it's probably far more bottlenecked by my Q5 MoE weights then the kv cache precision.

That 0.002 might make it break some use cases.

Ph0rk0z · 2026-03-29T16:27:36Z

I wonder if KLD from F16 can also be calculated. That way you don't need the KLD of the full model.

V-cache Hadamard transform

d11e1b2

ubergarm mentioned this pull request Mar 27, 2026

TurboQuant KV Cache Compression — Working Implementation Ready for Review #1509

Open

ikawrakow merged commit a959810 into main Mar 28, 2026

ikawrakow mentioned this pull request Mar 29, 2026

Even better Q4_0 KV cache #1547

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V-cache Hadamard transform#1527

V-cache Hadamard transform#1527
ikawrakow merged 1 commit intomainfrom
ik/v_cache_hadamard

ikawrakow commented Mar 27, 2026

Uh oh!

hksdpc255 commented Mar 27, 2026

Uh oh!

Ph0rk0z commented Mar 27, 2026

Uh oh!

ikawrakow commented Mar 27, 2026

Uh oh!

ikawrakow commented Mar 27, 2026

Uh oh!

Ph0rk0z commented Mar 27, 2026

Uh oh!

ikawrakow commented Mar 27, 2026

Uh oh!

MrHills-rs commented Mar 28, 2026

Uh oh!

tomByrer commented Mar 29, 2026

Uh oh!

ikawrakow commented Mar 29, 2026

Uh oh!

MrHills-rs commented Mar 29, 2026

Uh oh!

Ph0rk0z commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ikawrakow commented Mar 27, 2026

Uh oh!

hksdpc255 commented Mar 27, 2026

Uh oh!

Ph0rk0z commented Mar 27, 2026

Uh oh!

ikawrakow commented Mar 27, 2026

Uh oh!

ikawrakow commented Mar 27, 2026

Uh oh!

Ph0rk0z commented Mar 27, 2026

Uh oh!

ikawrakow commented Mar 27, 2026

Uh oh!

MrHills-rs commented Mar 28, 2026

With KHAD + VHAD

No khad, no vhad

Uh oh!

tomByrer commented Mar 29, 2026

Uh oh!

ikawrakow commented Mar 29, 2026

Uh oh!

MrHills-rs commented Mar 29, 2026

Uh oh!

Ph0rk0z commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants