Skip to content

Even better Q4_0 KV cache#1547

Merged
ikawrakow merged 1 commit intomainfrom
ik/better_q40_kv_cache
Mar 30, 2026
Merged

Even better Q4_0 KV cache#1547
ikawrakow merged 1 commit intomainfrom
ik/better_q40_kv_cache

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

@ikawrakow ikawrakow commented Mar 29, 2026

Even a low-key project such as ik_llama.cpp is being bombarded with TurboQuant hype.

OK, then, let us raise the bar even higher. The bar has been high in ik_llama.cpp since Hadamard transforms were added for the K-cache in PRs #1033 and #1034 a while ago. Hadamard transform for V-cache was added more recently in PR #1527 (but that has a much smaller impact than using Hadamard transforms for the K-cache).

This PR adds a scale adjustment for Q4_0 quantized KV cache, for a computationally cheap but noticeable improvement as measured by perplexity.

I'm noticing that IQ4_NL KV cache, while much better than Q4_0 without the Hadamard transform, with Hadamard either becomes worst than Q4_0, or the gap to Q4_0 is much reduced. I'm still looking into that, so for now just Q4_0 stuff. The same method used in this PR is already used for Q6_0 KV cache, and could also be added for Q5_0.

Here are some examples. The importance of KV cache quantization errors increases with increasing context length. Hence, instead of using a context of 512 customary in llama-land, the data in the table below is for a context of 8192 tokens. I did not go beyond 8192 tokens as the test corpus is wiki.test.raw, where very few articles exceed a context of 8k tokens. I did not add a comparison to a TurboQuant implementation to avoid discussions about how there was still a lot that could be improved in the TurboQuant implementations floating around the Internet.

Model PPL (f16) PPL (q4_0, main) PPL (q4_0, PR) Delta (%, main) Delta (%, PR)
LlaMA-3.1-8B, Q4_0 6.4258 6.5036 6.5025 +1.21 +1.19
LlaMA-3.1-70B, Q4_0 3.5332 3.6129 3.5721 +2.26 +1.10
Qwen3-8B-Base, bf16 6.0185 6.1448 6.1347 +2.10 +1.93
Gemma3-12B, bf16 7.2896 7.3427 7.3406 +0.73 +0.70
GLM-4.5-AIR, IQ4_KSS 5.5990 5.6471 5.5826 +0.86 -0.29
Qwen3.5-35B-A3B, IQ4_XS 5.8992 5.9241 5.9211 +0.42 +0.37
Qwen3.5-27B, Q4_K_S 7.1535 7.1939 7.1493 +0.56 -0.06

@vikcious
Copy link
Copy Markdown

"low-key project such as ik_llama.cpp" --> I still cannot wipe away my stupid face smile! If only you'd know how much of a difference this project makes to a lot of us! 👍

@ikawrakow
Copy link
Copy Markdown
Owner Author

"low-key project such as ik_llama.cpp" --> I still cannot wipe away my stupid face smile! If only you'd know how much of a difference this project makes to a lot of us! 👍

Happy to hear that it is useful for you. But that does not change the fact that ik_llama.cpp only has 1.9k GitHub stars, compared to 100k for llama.cpp, 23k for llamafile, 16k for KTranformers, etc. It is not that I subscribe to the concept that number of GitHub stars is somehow a measure of quality, but it is indeed a measure of popularity, and 2k GitHub stars is still well within the low-key category.

@vikcious
Copy link
Copy Markdown

Happy to hear that it is useful for you. But that does not change the fact that ik_llama.cpp only has 1.9k GitHub stars, compared to 100k for llama.cpp, 23k for llamafile, 16k for KTranformers, etc. It is not that I subscribe to the concept that number of GitHub stars is somehow a measure of quality, but it is indeed a measure of popularity, and 2k GitHub stars is still well within the low-key category.

If you are open to accept criticism... then here we go: your 'ik_llama.cpp' is more like a Gentoo Linux vs Ubuntu... far more powerful to highly technical people that have a sweet tooth for trial and error and another HUGE sweet tooth for ... masochism. In short: the biggest gripe I have so far (and compare it mentally to most of the other project you measured git stars against!) is the documentation level associated with the project: it's sparse, unfriendly, hardly maintained, all-over-the-place in "one" word.

I am still not getting it (pls, don't feel offended!) if you don't care about it, don't have time or simply don't know how to ask support to better the project documentation. If I get a good answer I might be thinking to take this with me and propose a similar documentation skeleton to give 'ik_llama.cpp' the friendliness and appreciation it deserves.

@magikRUKKOLA
Copy link
Copy Markdown

magikRUKKOLA commented Mar 29, 2026

@vikcious

far more powerful to highly technical people that have a sweet tooth for trial and error and another HUGE sweet tooth for ... masochism. In short: the biggest gripe I have so far (and compare it mentally to most of the other project you measured git stars against!)

I kinda disagree. I came from the ktransformers when I had been trying to run DeepSeek-R1 locally. And I just could not, at 60-80k ctx the next inference run would just output a garbage. I tried to debug here and there but failed. I created the issues in ktransformers, in flashinfer etc. No one. I underscore -- NO ONE -- ever responded to those issues.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Mar 29, 2026

Have documentation on the parameters now. That should be enough. It's just like any other program. At first its a little overwhelming and then you get used to it. Can always ask AI for help, especially ones with rag. Even the free search engine AI is enough. Somehow people don't.

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 29, 2026
Implement the same scale adjustment optimization for Q5_0 KV cache that was
already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an
optimal scale factor that minimizes quantization error by:

1. Computing weighted sums sumqx and sumq2 during quantization:
   - w0 = v0*v0, w1 = v1*v1 (weights based on actual values)
   - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset)
   - sumqx += w0*q0*v0 + w1*q1*v1
   - sumq2 += w0*q0*q0 + w1*q1*q1

2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0

This produces a computationally cheap but noticeable improvement in perplexity
for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547.

Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
@vektorprime
Copy link
Copy Markdown

"low-key project such as ik_llama.cpp" --> I still cannot wipe away my stupid face smile! If only you'd know how much of a difference this project makes to a lot of us! 👍

Happy to hear that it is useful for you. But that does not change the fact that ik_llama.cpp only has 1.9k GitHub stars, compared to 100k for llama.cpp, 23k for llamafile, 16k for KTranformers, etc. It is not that I subscribe to the concept that number of GitHub stars is somehow a measure of quality, but it is indeed a measure of popularity, and 2k GitHub stars is still well within the low-key category.

The first time I discovered ik_llama I looked at the release section of the page and saw no compiled releases and just left. The bar to entry is too high if the common people have to compile it. I know you're not chasing GitHub stars, but that is the "One simple trick popular repos don't want you to know!" There's so much great work here, but not everyone here codes or even understand 5% of this, they just want great performance and quality. Both of which ik_llama providers. Anyway, I ended up coming back months later and compiling. Thanks for the great software! Truly leading the inference landscape.

@magikRUKKOLA
Copy link
Copy Markdown

The first time I discovered ik_llama I looked at the release section of the page and saw no compiled releases and just left.

Lol I remember myself having the same thoughts. It might contribute to the stars situation a lot, indeed.

@chatchatgpt
Copy link
Copy Markdown

In China, you can find vendors selling GitHub star counts
Personally, I don't place much value on star counts; even if they're all from real users, they only represent the recognition of others.
Everything @ikawrakow does isn't for simple recognition. His GitHub profile says "gentleman at large," which translates to "leisurely gentleman" in Chinese. Look at his other projects, like sphere12d—even AI praises it highly. Though we're thousands of miles apart, I suspect ik is a top expert in his field.
Such a leisurely gentleman, with no worries about living expenses, is purely here for entertainment. Playing around, he has reached the forefront of the llm inference field.
Let him optimize the documentation, then attract a bunch of merchants to package his code into high-performance, low-memory-demand products to make a fortune. I'd rather see him spend more time on Vulkan and RPC

@magikRUKKOLA
Copy link
Copy Markdown

magikRUKKOLA commented Mar 30, 2026

@ikawrakow

Qwen3.5-27B IQ5_KS:

iq4_nl khad, vhad:

Final estimate: PPL over 580 chunks for n_ctx=512 = 6.8993 +/- 0.04498

iq4_nl (no khad/vhad):

Final estimate: PPL over 580 chunks for n_ctx=512 = 6.8996 +/- 0.04496

q8_0:

Final estimate: PPL over 580 chunks for n_ctx=512 = 6.8816 +/- 0.04483

@ikawrakow ikawrakow merged commit b9a2ce4 into main Mar 30, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 30, 2026
Implement the same scale adjustment optimization for Q5_0 KV cache that was
already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an
optimal scale factor that minimizes quantization error by:

1. Computing weighted sums sumqx and sumq2 during quantization:
   - w0 = v0*v0, w1 = v1*v1 (weights based on actual values)
   - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset)
   - sumqx += w0*q0*v0 + w1*q1*v1
   - sumq2 += w0*q0*q0 + w1*q1*q1

2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0

This produces a computationally cheap but noticeable improvement in perplexity
for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547.

Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 31, 2026
Implement the same scale adjustment optimization for Q5_0 KV cache that was
already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an
optimal scale factor that minimizes quantization error by:

1. Computing weighted sums sumqx and sumq2 during quantization:
   - w0 = v0*v0, w1 = v1*v1 (weights based on actual values)
   - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset)
   - sumqx += w0*q0*v0 + w1*q1*v1
   - sumq2 += w0*q0*q0 + w1*q1*q1

2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0

This produces a computationally cheap but noticeable improvement in perplexity
for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547.

Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 31, 2026
Implement the same scale adjustment optimization for Q5_0 KV cache that was
already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an
optimal scale factor that minimizes quantization error by:

1. Computing weighted sums sumqx and sumq2 during quantization:
   - w0 = v0*v0, w1 = v1*v1 (weights based on actual values)
   - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset)
   - sumqx += w0*q0*v0 + w1*q1*v1
   - sumq2 += w0*q0*q0 + w1*q1*q1

2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0

This produces a computationally cheap but noticeable improvement in perplexity
for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547.

Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Apr 1, 2026
Implement the same scale adjustment optimization for Q5_0 KV cache that was
already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an
optimal scale factor that minimizes quantization error by:

1. Computing weighted sums sumqx and sumq2 during quantization:
   - w0 = v0*v0, w1 = v1*v1 (weights based on actual values)
   - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset)
   - sumqx += w0*q0*v0 + w1*q1*v1
   - sumq2 += w0*q0*q0 + w1*q1*q1

2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0

This produces a computationally cheap but noticeable improvement in perplexity
for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547.

Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Apr 1, 2026
Implement the same scale adjustment optimization for Q5_0 KV cache that was
already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an
optimal scale factor that minimizes quantization error by:

1. Computing weighted sums sumqx and sumq2 during quantization:
   - w0 = v0*v0, w1 = v1*v1 (weights based on actual values)
   - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset)
   - sumqx += w0*q0*v0 + w1*q1*v1
   - sumq2 += w0*q0*q0 + w1*q1*q1

2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0

This produces a computationally cheap but noticeable improvement in perplexity
for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547.

Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Apr 1, 2026
Implement the same scale adjustment optimization for Q5_0 KV cache that was
already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an
optimal scale factor that minimizes quantization error by:

1. Computing weighted sums sumqx and sumq2 during quantization:
   - w0 = v0*v0, w1 = v1*v1 (weights based on actual values)
   - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset)
   - sumqx += w0*q0*v0 + w1*q1*v1
   - sumq2 += w0*q0*q0 + w1*q1*q1

2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0

This produces a computationally cheap but noticeable improvement in perplexity
for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547.

Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Apr 2, 2026
Implement the same scale adjustment optimization for Q5_0 KV cache that was
already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an
optimal scale factor that minimizes quantization error by:

1. Computing weighted sums sumqx and sumq2 during quantization:
   - w0 = v0*v0, w1 = v1*v1 (weights based on actual values)
   - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset)
   - sumqx += w0*q0*v0 + w1*q1*v1
   - sumq2 += w0*q0*q0 + w1*q1*q1

2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0

This produces a computationally cheap but noticeable improvement in perplexity
for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547.

Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Apr 3, 2026
Implement the same scale adjustment optimization for Q5_0 KV cache that was
already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an
optimal scale factor that minimizes quantization error by:

1. Computing weighted sums sumqx and sumq2 during quantization:
   - w0 = v0*v0, w1 = v1*v1 (weights based on actual values)
   - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset)
   - sumqx += w0*q0*v0 + w1*q1*v1
   - sumq2 += w0*q0*q0 + w1*q1*q1

2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0

This produces a computationally cheap but noticeable improvement in perplexity
for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547.

Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Apr 3, 2026
Implement the same scale adjustment optimization for Q5_0 KV cache that was
already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an
optimal scale factor that minimizes quantization error by:

1. Computing weighted sums sumqx and sumq2 during quantization:
   - w0 = v0*v0, w1 = v1*v1 (weights based on actual values)
   - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset)
   - sumqx += w0*q0*v0 + w1*q1*v1
   - sumq2 += w0*q0*q0 + w1*q1*q1

2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0

This produces a computationally cheap but noticeable improvement in perplexity
for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547.

Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Apr 4, 2026
Implement the same scale adjustment optimization for Q5_0 KV cache that was
already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an
optimal scale factor that minimizes quantization error by:

1. Computing weighted sums sumqx and sumq2 during quantization:
   - w0 = v0*v0, w1 = v1*v1 (weights based on actual values)
   - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset)
   - sumqx += w0*q0*v0 + w1*q1*v1
   - sumq2 += w0*q0*q0 + w1*q1*q1

2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0

This produces a computationally cheap but noticeable improvement in perplexity
for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547.

Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants