Skip to content

V-cache Hadamard transform#1527

Merged
ikawrakow merged 1 commit intomainfrom
ik/v_cache_hadamard
Mar 28, 2026
Merged

V-cache Hadamard transform#1527
ikawrakow merged 1 commit intomainfrom
ik/v_cache_hadamard

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

This PR adds the ability to also use Hadamard transforms for the V cache. As with the K-cache, V-head size must be 64, 128 or 256 (so not applicable to MLA models such as Kimi or GLM-5).

Unlike K-cache, very modest accuracy gains as measured by PPL. Hence, comparisons are left to the curious user.

But, as mainline is "raising the bar" 3.5 months after ik_llama.cpp (see #1033, #1034), I decided to add the ability to use Hadamard-transformed V-cache to avoid any potential claims that mainline's approach is somehow better.

To use Hadamard-transformed V-cache, add -vhad | --v-cache-hadamard to the command line. The V cache must be quantized for that to make sense, which requires flash attention to be turned on.

@hksdpc255
Copy link
Copy Markdown
Contributor

I remember you mentioned that the Hadamard transform can achieve more precise results. Why is it not enabled by default? What are the trade-offs?

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Mar 27, 2026

It's slightly slower at inference.

@ikawrakow
Copy link
Copy Markdown
Owner Author

I remember you mentioned that the Hadamard transform can achieve more precise results. Why is it not enabled by default? What are the trade-offs?

The main reason it is not on by default is that the code making all necessary checks has not been written. The attention head size must be a power of two, the cache must be quantized, the model cannot use MLA attention, etc. My thinking is that a user who on purpose turns on this feature has understood the use case scenarios and usage constrains (although, judging by some of the issues being posted, the sad reality is that the average user is just copy-pasting commands they have found around the Internet).

Also, turning on Hadamard transforms comes with a small performance penalty, so that may not be what the users obsessed with performance would want.

The good news is that now with all the TurboQuant hype, many people will learn that one can apply a rotation to the KV cash data, so perhaps usage of this feature will become more common and better informed.

@ikawrakow
Copy link
Copy Markdown
Owner Author

It's slightly slower at inference.

Well, yes, it need 2 extra operations. That's why it is a separate command line argument so that the better informed user has the option to not use it, given the small performance penalty and very minor quality benefit.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Mar 27, 2026

For those big dense models, it has basically been nothing so I just turn it on, even with Q8 cache. Oh and someone is trying to use TurboQuant to quantize entire models. Think there is any merit to it?

@ikawrakow
Copy link
Copy Markdown
Owner Author

For those big dense models, it has basically been nothing so I just turn it on, even with Q8 cache. Oh and someone is trying to use TurboQuant to quantize entire models. Think there is any merit to it?

It will be just like any other hype. In a week or two nobody will remember it, and everybody will be chasing the next hype wave.

@MrHills-rs
Copy link
Copy Markdown

No real performance loss from normal q8, but better PPL then f16. With ram speed bottleneck at least.

PPL at 65536 ctx
Q8:
[1]2.2262, [2]3.0139, [3]3.0594, [4]3.3343,
Final estimate: PPL over 4 chunks for n_ctx=65536 = 3.3343 +/- 0.01853
F16:
[1]2.2259, [2]3.0124, [3]3.0582, [4]3.3334,
Final estimate: PPL over 4 chunks for n_ctx=65536 = 3.3334 +/- 0.01852
Q8 + khad + vhad
[1]2.2251, [2]3.0108, [3]3.0570, [4]3.3322,
Final estimate: PPL over 4 chunks for n_ctx=65536 = 3.3322 +/- 0.01851

  • n_kv_max = 65536
  • n_batch = 4096
  • n_ubatch = 4096
  • flash_attn = 1
  • n_gpu_layers = 95
  • n_threads = 7
  • n_threads_batch = 8

With KHAD + VHAD

Prompt tokens (PP) Generated tokens (TG) KV Cache size (N_KV) PP Time (s) PP Speed (t/s) TG Time (s) TG Speed (t/s)
4096 128 0 6.139 667.25 11.323 11.30
4096 128 4096 6.139 667.18 11.394 11.23
4096 128 8192 6.204 660.18 11.398 11.23
4096 128 12288 6.279 652.36 11.418 11.21
4096 128 16384 6.338 646.24 11.424 11.20
4096 128 20480 6.324 647.71 11.434 11.19
4096 128 24576 6.381 641.88 11.435 11.19
4096 128 28672 6.396 640.37 11.452 11.18
4096 128 32768 6.484 631.75 11.517 11.11
4096 128 36864 6.525 627.74 11.554 11.08
4096 128 40960 6.534 626.89 11.567 11.07
4096 128 45056 6.614 619.29 11.606 11.03
4096 128 49152 6.655 615.52 11.654 10.98
4096 128 53248 6.699 611.39 11.684 10.95
4096 128 57344 6.732 608.39 11.718 10.92
4096 128 61440 6.804 601.99 11.738 10.91

No khad, no vhad

Prompt tokens (PP) Generated tokens (TG) KV Cache size (N_KV) PP Time (s) PP Speed (t/s) TG Time (s) TG Speed (t/s)
4096 128 0 6.126 668.60 11.356 11.27
4096 128 4096 6.133 667.89 11.365 11.26
4096 128 8192 6.195 661.17 11.378 11.25
4096 128 12288 6.269 653.33 11.401 11.23
4096 128 16384 6.328 647.32 11.436 11.19
4096 128 20480 6.314 648.74 11.462 11.17
4096 128 24576 6.370 643.00 11.429 11.20
4096 128 28672 6.390 641.02 11.507 11.12
4096 128 32768 6.469 633.22 11.522 11.11
4096 128 36864 6.510 629.17 11.546 11.09
4096 128 40960 6.534 626.88 11.548 11.08
4096 128 45056 6.595 621.12 11.617 11.02
4096 128 49152 6.641 616.75 11.625 11.01
4096 128 53248 6.685 612.71 11.659 10.98
4096 128 57344 6.728 608.77 11.730 10.91
4096 128 61440 6.797 602.61 11.794 10.85

@tomByrer
Copy link
Copy Markdown

With KHAD + VHAD

Seems not that much of an improvement?
Either way, thanks for the charts!

@ikawrakow
Copy link
Copy Markdown
Owner Author

With KHAD + VHAD

Seems not that much of an improvement? Either way, thanks for the charts!

Q8_0 quantized KV cache is basically lossless, so not much to improve there. One benefits from -khad and -vhad for KV cache quantized with less than 8 bits.

@MrHills-rs
Copy link
Copy Markdown

With KHAD + VHAD

Seems not that much of an improvement? Either way, thanks for the charts!

Well that's just PPL. When it's really high you can use it to understand that something is wrong, and when it's higher in a certain configuration you can guesstimate that that configuration is probably better for your use case.

But how much better isn't something you can estimate by a PPL delta. At the end of the day you're predicting Wikipedia text without even an instruct format. Has nothing to do with 99% of use cases. And it's probably far more bottlenecked by my Q5 MoE weights then the kv cache precision.

That 0.002 might make it break some use cases.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Mar 29, 2026

I wonder if KLD from F16 can also be calculated. That way you don't need the KLD of the full model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants