Conversation
|
"low-key project such as ik_llama.cpp" --> I still cannot wipe away my stupid face smile! If only you'd know how much of a difference this project makes to a lot of us! 👍 |
Happy to hear that it is useful for you. But that does not change the fact that |
If you are open to accept criticism... then here we go: your 'ik_llama.cpp' is more like a Gentoo Linux vs Ubuntu... far more powerful to highly technical people that have a sweet tooth for trial and error and another HUGE sweet tooth for ... masochism. In short: the biggest gripe I have so far (and compare it mentally to most of the other project you measured git stars against!) is the documentation level associated with the project: it's sparse, unfriendly, hardly maintained, all-over-the-place in "one" word. I am still not getting it (pls, don't feel offended!) if you don't care about it, don't have time or simply don't know how to ask support to better the project documentation. If I get a good answer I might be thinking to take this with me and propose a similar documentation skeleton to give 'ik_llama.cpp' the friendliness and appreciation it deserves. |
I kinda disagree. I came from the |
|
Have documentation on the parameters now. That should be enough. It's just like any other program. At first its a little overwhelming and then you get used to it. Can always ask AI for help, especially ones with rag. Even the free search engine AI is enough. Somehow people don't. |
Implement the same scale adjustment optimization for Q5_0 KV cache that was already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an optimal scale factor that minimizes quantization error by: 1. Computing weighted sums sumqx and sumq2 during quantization: - w0 = v0*v0, w1 = v1*v1 (weights based on actual values) - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset) - sumqx += w0*q0*v0 + w1*q1*v1 - sumq2 += w0*q0*q0 + w1*q1*q1 2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0 This produces a computationally cheap but noticeable improvement in perplexity for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547. Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
The first time I discovered ik_llama I looked at the release section of the page and saw no compiled releases and just left. The bar to entry is too high if the common people have to compile it. I know you're not chasing GitHub stars, but that is the "One simple trick popular repos don't want you to know!" There's so much great work here, but not everyone here codes or even understand 5% of this, they just want great performance and quality. Both of which ik_llama providers. Anyway, I ended up coming back months later and compiling. Thanks for the great software! Truly leading the inference landscape. |
Lol I remember myself having the same thoughts. It might contribute to the stars situation a lot, indeed. |
|
In China, you can find vendors selling GitHub star counts |
|
Qwen3.5-27B IQ5_KS: iq4_nl khad, vhad: iq4_nl (no khad/vhad): q8_0: |
Implement the same scale adjustment optimization for Q5_0 KV cache that was already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an optimal scale factor that minimizes quantization error by: 1. Computing weighted sums sumqx and sumq2 during quantization: - w0 = v0*v0, w1 = v1*v1 (weights based on actual values) - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset) - sumqx += w0*q0*v0 + w1*q1*v1 - sumq2 += w0*q0*q0 + w1*q1*q1 2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0 This produces a computationally cheap but noticeable improvement in perplexity for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547. Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Implement the same scale adjustment optimization for Q5_0 KV cache that was already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an optimal scale factor that minimizes quantization error by: 1. Computing weighted sums sumqx and sumq2 during quantization: - w0 = v0*v0, w1 = v1*v1 (weights based on actual values) - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset) - sumqx += w0*q0*v0 + w1*q1*v1 - sumq2 += w0*q0*q0 + w1*q1*q1 2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0 This produces a computationally cheap but noticeable improvement in perplexity for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547. Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Implement the same scale adjustment optimization for Q5_0 KV cache that was already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an optimal scale factor that minimizes quantization error by: 1. Computing weighted sums sumqx and sumq2 during quantization: - w0 = v0*v0, w1 = v1*v1 (weights based on actual values) - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset) - sumqx += w0*q0*v0 + w1*q1*v1 - sumq2 += w0*q0*q0 + w1*q1*q1 2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0 This produces a computationally cheap but noticeable improvement in perplexity for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547. Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Implement the same scale adjustment optimization for Q5_0 KV cache that was already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an optimal scale factor that minimizes quantization error by: 1. Computing weighted sums sumqx and sumq2 during quantization: - w0 = v0*v0, w1 = v1*v1 (weights based on actual values) - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset) - sumqx += w0*q0*v0 + w1*q1*v1 - sumq2 += w0*q0*q0 + w1*q1*q1 2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0 This produces a computationally cheap but noticeable improvement in perplexity for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547. Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Implement the same scale adjustment optimization for Q5_0 KV cache that was already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an optimal scale factor that minimizes quantization error by: 1. Computing weighted sums sumqx and sumq2 during quantization: - w0 = v0*v0, w1 = v1*v1 (weights based on actual values) - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset) - sumqx += w0*q0*v0 + w1*q1*v1 - sumq2 += w0*q0*q0 + w1*q1*q1 2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0 This produces a computationally cheap but noticeable improvement in perplexity for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547. Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Implement the same scale adjustment optimization for Q5_0 KV cache that was already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an optimal scale factor that minimizes quantization error by: 1. Computing weighted sums sumqx and sumq2 during quantization: - w0 = v0*v0, w1 = v1*v1 (weights based on actual values) - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset) - sumqx += w0*q0*v0 + w1*q1*v1 - sumq2 += w0*q0*q0 + w1*q1*q1 2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0 This produces a computationally cheap but noticeable improvement in perplexity for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547. Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Implement the same scale adjustment optimization for Q5_0 KV cache that was already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an optimal scale factor that minimizes quantization error by: 1. Computing weighted sums sumqx and sumq2 during quantization: - w0 = v0*v0, w1 = v1*v1 (weights based on actual values) - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset) - sumqx += w0*q0*v0 + w1*q1*v1 - sumq2 += w0*q0*q0 + w1*q1*q1 2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0 This produces a computationally cheap but noticeable improvement in perplexity for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547. Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Implement the same scale adjustment optimization for Q5_0 KV cache that was already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an optimal scale factor that minimizes quantization error by: 1. Computing weighted sums sumqx and sumq2 during quantization: - w0 = v0*v0, w1 = v1*v1 (weights based on actual values) - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset) - sumqx += w0*q0*v0 + w1*q1*v1 - sumq2 += w0*q0*q0 + w1*q1*q1 2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0 This produces a computationally cheap but noticeable improvement in perplexity for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547. Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Implement the same scale adjustment optimization for Q5_0 KV cache that was already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an optimal scale factor that minimizes quantization error by: 1. Computing weighted sums sumqx and sumq2 during quantization: - w0 = v0*v0, w1 = v1*v1 (weights based on actual values) - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset) - sumqx += w0*q0*v0 + w1*q1*v1 - sumq2 += w0*q0*q0 + w1*q1*q1 2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0 This produces a computationally cheap but noticeable improvement in perplexity for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547. Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Implement the same scale adjustment optimization for Q5_0 KV cache that was already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an optimal scale factor that minimizes quantization error by: 1. Computing weighted sums sumqx and sumq2 during quantization: - w0 = v0*v0, w1 = v1*v1 (weights based on actual values) - q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset) - sumqx += w0*q0*v0 + w1*q1*v1 - sumq2 += w0*q0*q0 + w1*q1*q1 2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0 This produces a computationally cheap but noticeable improvement in perplexity for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547. Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.
Even a low-key project such as
ik_llama.cppis being bombarded with TurboQuant hype.OK, then, let us raise the bar even higher. The bar has been high in
ik_llama.cppsince Hadamard transforms were added for the K-cache in PRs #1033 and #1034 a while ago. Hadamard transform for V-cache was added more recently in PR #1527 (but that has a much smaller impact than using Hadamard transforms for the K-cache).This PR adds a scale adjustment for
Q4_0quantized KV cache, for a computationally cheap but noticeable improvement as measured by perplexity.I'm noticing that
IQ4_NLKV cache, while much better thanQ4_0without the Hadamard transform, with Hadamard either becomes worst thanQ4_0, or the gap toQ4_0is much reduced. I'm still looking into that, so for now justQ4_0stuff. The same method used in this PR is already used forQ6_0KV cache, and could also be added forQ5_0.Here are some examples. The importance of KV cache quantization errors increases with increasing context length. Hence, instead of using a context of 512 customary in llama-land, the data in the table below is for a context of 8192 tokens. I did not go beyond 8192 tokens as the test corpus is
wiki.test.raw, where very few articles exceed a context of 8k tokens. I did not add a comparison to a TurboQuant implementation to avoid discussions about how there was still a lot that could be improved in the TurboQuant implementations floating around the Internet.