Commit 73ec4ca
committed
KV cache: Add Q5_0 scale adjustment optimization
Implement the same scale adjustment optimization for Q5_0 KV cache that was
already applied to Q4_0 (PR ikawrakow#1547) and Q6_0. This optimization computes an
optimal scale factor that minimizes quantization error by:
1. Computing weighted sums sumqx and sumq2 during quantization:
- w0 = v0*v0, w1 = v1*v1 (weights based on actual values)
- q0 = xi0 - 16, q1 = xi1 - 16 (quantized values offset)
- sumqx += w0*q0*v0 + w1*q1*v1
- sumq2 += w0*q0*q0 + w1*q1*q1
2. Setting the final scale as y->d = sumqx/sumq2 when sumq2 > 0
This produces a computationally cheap but noticeable improvement in perplexity
for KV cache quantization, similar to the results seen for Q4_0 in PR ikawrakow#1547.
Based on work by Iwan Kawrakow (ikawrakow) - lead LLM quantization developer.1 parent 0ddd2e9 commit 73ec4ca
1 file changed
+14
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
109 | 109 | | |
110 | 110 | | |
111 | 111 | | |
| 112 | + | |
112 | 113 | | |
113 | | - | |
114 | | - | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
115 | 118 | | |
116 | 119 | | |
117 | 120 | | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
118 | 127 | | |
119 | 128 | | |
120 | 129 | | |
121 | 130 | | |
122 | 131 | | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
123 | 135 | | |
124 | 136 | | |
125 | 137 | | |
| |||
0 commit comments