Even better Q4_0 KV cache by ikawrakow · Pull Request #1547 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-03-29T11:02:51Z

Even a low-key project such as ik_llama.cpp is being bombarded with TurboQuant hype.

OK, then, let us raise the bar even higher. The bar has been high in ik_llama.cpp since Hadamard transforms were added for the K-cache in PRs #1033 and #1034 a while ago. Hadamard transform for V-cache was added more recently in PR #1527 (but that has a much smaller impact than using Hadamard transforms for the K-cache).

This PR adds a scale adjustment for Q4_0 quantized KV cache, for a computationally cheap but noticeable improvement as measured by perplexity.

I'm noticing that IQ4_NL KV cache, while much better than Q4_0 without the Hadamard transform, with Hadamard either becomes worst than Q4_0, or the gap to Q4_0 is much reduced. I'm still looking into that, so for now just Q4_0 stuff. The same method used in this PR is already used for Q6_0 KV cache, and could also be added for Q5_0.

Here are some examples. The importance of KV cache quantization errors increases with increasing context length. Hence, instead of using a context of 512 customary in llama-land, the data in the table below is for a context of 8192 tokens. I did not go beyond 8192 tokens as the test corpus is wiki.test.raw, where very few articles exceed a context of 8k tokens. I did not add a comparison to a TurboQuant implementation to avoid discussions about how there was still a lot that could be improved in the TurboQuant implementations floating around the Internet.

Model	PPL (f16)	PPL (q4_0, main)	PPL (q4_0, PR)	Delta (%, main)	Delta (%, PR)
LlaMA-3.1-8B, Q4_0	6.4258	6.5036	6.5025	+1.21	+1.19
LlaMA-3.1-70B, Q4_0	3.5332	3.6129	3.5721	+2.26	+1.10
Qwen3-8B-Base, bf16	6.0185	6.1448	6.1347	+2.10	+1.93
Gemma3-12B, bf16	7.2896	7.3427	7.3406	+0.73	+0.70
GLM-4.5-AIR, IQ4_KSS	5.5990	5.6471	5.5826	+0.86	-0.29
Qwen3.5-35B-A3B, IQ4_XS	5.8992	5.9241	5.9211	+0.42	+0.37
Qwen3.5-27B, Q4_K_S	7.1535	7.1939	7.1493	+0.56	-0.06

vikcious · 2026-03-29T12:17:37Z

"low-key project such as ik_llama.cpp" --> I still cannot wipe away my stupid face smile! If only you'd know how much of a difference this project makes to a lot of us! 👍

ikawrakow · 2026-03-29T12:44:29Z

"low-key project such as ik_llama.cpp" --> I still cannot wipe away my stupid face smile! If only you'd know how much of a difference this project makes to a lot of us! 👍

Happy to hear that it is useful for you. But that does not change the fact that ik_llama.cpp only has 1.9k GitHub stars, compared to 100k for llama.cpp, 23k for llamafile, 16k for KTranformers, etc. It is not that I subscribe to the concept that number of GitHub stars is somehow a measure of quality, but it is indeed a measure of popularity, and 2k GitHub stars is still well within the low-key category.

vikcious · 2026-03-29T12:57:48Z

Happy to hear that it is useful for you. But that does not change the fact that ik_llama.cpp only has 1.9k GitHub stars, compared to 100k for llama.cpp, 23k for llamafile, 16k for KTranformers, etc. It is not that I subscribe to the concept that number of GitHub stars is somehow a measure of quality, but it is indeed a measure of popularity, and 2k GitHub stars is still well within the low-key category.

If you are open to accept criticism... then here we go: your 'ik_llama.cpp' is more like a Gentoo Linux vs Ubuntu... far more powerful to highly technical people that have a sweet tooth for trial and error and another HUGE sweet tooth for ... masochism. In short: the biggest gripe I have so far (and compare it mentally to most of the other project you measured git stars against!) is the documentation level associated with the project: it's sparse, unfriendly, hardly maintained, all-over-the-place in "one" word.

I am still not getting it (pls, don't feel offended!) if you don't care about it, don't have time or simply don't know how to ask support to better the project documentation. If I get a good answer I might be thinking to take this with me and propose a similar documentation skeleton to give 'ik_llama.cpp' the friendliness and appreciation it deserves.

magikRUKKOLA · 2026-03-29T16:52:13Z

@vikcious

far more powerful to highly technical people that have a sweet tooth for trial and error and another HUGE sweet tooth for ... masochism. In short: the biggest gripe I have so far (and compare it mentally to most of the other project you measured git stars against!)

I kinda disagree. I came from the ktransformers when I had been trying to run DeepSeek-R1 locally. And I just could not, at 60-80k ctx the next inference run would just output a garbage. I tried to debug here and there but failed. I created the issues in ktransformers, in flashinfer etc. No one. I underscore -- NO ONE -- ever responded to those issues.

Ph0rk0z · 2026-03-29T17:04:15Z

Have documentation on the parameters now. That should be enough. It's just like any other program. At first its a little overwhelming and then you get used to it. Can always ask AI for help, especially ones with rag. Even the free search engine AI is enough. Somehow people don't.

Implement the same scale adjustment optimization for Q5_0 KV cache that was already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an optimal scale factor that minimizes quantization error by: 1. Computing weighted sums sumqx and sumq2 during quantization: - w0 = v0*v0, w1 = v1*v1 (weights based on actual values) - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset) - sumqx += w0*q0*v0 + w1*q1*v1 - sumq2 += w0*q0*q0 + w1*q1*q1 2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0 This produces a computationally cheap but noticeable improvement in perplexity for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547. Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.

vektorprime · 2026-03-29T23:01:40Z

"low-key project such as ik_llama.cpp" --> I still cannot wipe away my stupid face smile! If only you'd know how much of a difference this project makes to a lot of us! 👍

Happy to hear that it is useful for you. But that does not change the fact that ik_llama.cpp only has 1.9k GitHub stars, compared to 100k for llama.cpp, 23k for llamafile, 16k for KTranformers, etc. It is not that I subscribe to the concept that number of GitHub stars is somehow a measure of quality, but it is indeed a measure of popularity, and 2k GitHub stars is still well within the low-key category.

The first time I discovered ik_llama I looked at the release section of the page and saw no compiled releases and just left. The bar to entry is too high if the common people have to compile it. I know you're not chasing GitHub stars, but that is the "One simple trick popular repos don't want you to know!" There's so much great work here, but not everyone here codes or even understand 5% of this, they just want great performance and quality. Both of which ik_llama providers. Anyway, I ended up coming back months later and compiling. Thanks for the great software! Truly leading the inference landscape.

magikRUKKOLA · 2026-03-30T01:28:25Z

The first time I discovered ik_llama I looked at the release section of the page and saw no compiled releases and just left.

Lol I remember myself having the same thoughts. It might contribute to the stars situation a lot, indeed.

chatchatgpt · 2026-03-30T01:47:29Z

In China, you can find vendors selling GitHub star counts
Personally, I don't place much value on star counts; even if they're all from real users, they only represent the recognition of others.
Everything @ikawrakow does isn't for simple recognition. His GitHub profile says "gentleman at large," which translates to "leisurely gentleman" in Chinese. Look at his other projects, like sphere12d—even AI praises it highly. Though we're thousands of miles apart, I suspect ik is a top expert in his field.
Such a leisurely gentleman, with no worries about living expenses, is purely here for entertainment. Playing around, he has reached the forefront of the llm inference field.
Let him optimize the documentation, then attract a bunch of merchants to package his code into high-performance, low-memory-demand products to make a fortune. I'd rather see him spend more time on Vulkan and RPC

magikRUKKOLA · 2026-03-30T04:10:44Z

@ikawrakow

Qwen3.5-27B IQ5_KS:

iq4_nl khad, vhad:

Final estimate: PPL over 580 chunks for n_ctx=512 = 6.8993 +/- 0.04498

iq4_nl (no khad/vhad):

Final estimate: PPL over 580 chunks for n_ctx=512 = 6.8996 +/- 0.04496

q8_0:

Final estimate: PPL over 580 chunks for n_ctx=512 = 6.8816 +/- 0.04483

Implement the same scale adjustment optimization for Q5_0 KV cache that was already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an optimal scale factor that minimizes quantization error by: 1. Computing weighted sums sumqx and sumq2 during quantization: - w0 = v0*v0, w1 = v1*v1 (weights based on actual values) - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset) - sumqx += w0*q0*v0 + w1*q1*v1 - sumq2 += w0*q0*q0 + w1*q1*q1 2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0 This produces a computationally cheap but noticeable improvement in perplexity for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547. Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.

Even better Q4_0 KV cache

4147c7b

jnovy mentioned this pull request Mar 29, 2026

llama : fix KV cache quantization for hybrid Mamba/attention models #1548

Closed

4 tasks

ikawrakow merged commit b9a2ce4 into main Mar 30, 2026

ikawrakow mentioned this pull request Mar 30, 2026

Even better Q4_0 KV cache (CPU) #1556

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Even better Q4_0 KV cache#1547

Even better Q4_0 KV cache#1547
ikawrakow merged 1 commit intomainfrom
ik/better_q40_kv_cache

ikawrakow commented Mar 29, 2026 •

edited

Loading

Uh oh!

vikcious commented Mar 29, 2026

Uh oh!

ikawrakow commented Mar 29, 2026

Uh oh!

vikcious commented Mar 29, 2026

Uh oh!

magikRUKKOLA commented Mar 29, 2026 •

edited

Loading

Uh oh!

Ph0rk0z commented Mar 29, 2026

Uh oh!

vektorprime commented Mar 29, 2026

Uh oh!

magikRUKKOLA commented Mar 30, 2026

Uh oh!

chatchatgpt commented Mar 30, 2026

Uh oh!

magikRUKKOLA commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

ikawrakow commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vikcious commented Mar 29, 2026

Uh oh!

ikawrakow commented Mar 29, 2026

Uh oh!

vikcious commented Mar 29, 2026

Uh oh!

magikRUKKOLA commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented Mar 29, 2026

Uh oh!

vektorprime commented Mar 29, 2026

Uh oh!

magikRUKKOLA commented Mar 30, 2026

Uh oh!

chatchatgpt commented Mar 30, 2026

Uh oh!

magikRUKKOLA commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ikawrakow commented Mar 29, 2026 •

edited

Loading

magikRUKKOLA commented Mar 29, 2026 •

edited

Loading

magikRUKKOLA commented Mar 30, 2026 •

edited

Loading