Skip to content

K-cache Hadamard transforms (CUDA)#1034

Merged
ikawrakow merged 2 commits intomainfrom
ik/k_cache_hadamard_cuda
Dec 4, 2025
Merged

K-cache Hadamard transforms (CUDA)#1034
ikawrakow merged 2 commits intomainfrom
ik/k_cache_hadamard_cuda

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

This PR is a follow up of #1033 and adds support for K-cache Hadamard transforms on CUDA.

There are various CUDA Hadamard transform implementations on the Internet, but me being me and not liking the addition of external dependencies (or copy pasting a pile of code that I don't understand), I rolled my own. It is possibly not the fastest possible implementation, but based on performance benchmarks (see below) it cannot be totally bad either.

Being able to run PPL calculations on the GPU more quickly, here are some results for a bunch of models. In all cases the V-cache is left as f16.

Qwen3-30B-A3B, IQ2_XXS quantization

The Q6_0 result is somewhat peculiar, but Qwen3-30B-A3B does show some strange behavior when it comes to PPL (see #359)

K-cache PPL (no Hadamard) PPL (Hadamard) Diff to f16 (no H) Diff to f16 (H)
f16 10.5681 N/A N/A N/A
Q8_0 10.5727 10.5758 0.04% 0.07%
Q6_0 10.4908 10.5779 -0.73% 0.09%
Q5_0 10.6518 10.5959 0.79% 0.26%
Q4_0 11.7625 10.6654 11.30% 0.92%

Ling-Mini-2.0, Q4_K_M quantization

This model clearly does not like K-cache quantization with less than 8 bpw.

K-cache PPL (no Hadamard) PPL (Hadamard) Diff to f16 (no H) Diff to f16 (H)
f16 13.3744 N/A N/A N/A
Q8_0 13.3699 13.3659 -0.03% -0.06%
Q6_0 13.6463 13.4754 2.03% 0.76%
Q5_0 14.4198 13.5800 7.82% 1.54%
Q4_0 14.3242 14.0298 7.10% 4.90%

Ministral3-8B-Instruct, Q8_0 quantization

This model does not mind quantized K-cache.

K-cache PPL (no Hadamard) PPL (Hadamard) Diff to f16 (no H) Diff to f16 (H)
f16 7.5939 N/A N/A N/A
Q8_0 7.5935 7.5943 0.00% 0.00%
Q6_0 7.5999 7.5940 0.08% 0.00%
Q5_0 7.6097 7.5995 0.21% 0.07%
Q4_0 7.6398 7.6281 0.60% 0.45%

TheDrummer_Tiger-Gemma-12B-v3, IQ4_NL

K-cache PPL (no Hadamard) PPL (Hadamard) Diff to f16 (no H) Diff to f16 (H)
f16 8.7003 N/A N/A N/A
Q8_0 8.6972 8.6988 -0.04% -0.02%
Q6_0 8.7013 8.7007 0.01% 0.00%
Q5_0 8.7236 8.6980 0.27% -0.03%
Q4_0 8.8242 8.7641 1.42% 0.73%

GLM-4.6, 5.5 bpw Thireus mix

It takes about 17 minutes to run one PPL calculation on my 2x3090 box, so here fewer results.

K-cache PPL (no Hadamard) PPL (Hadamard) Diff to f16 (no H) Diff to f16 (H)
f16 3.4513 N/A N/A N/A
Q6_0 3.4557 3.4573 0.13% 0.17%
Q4_0 3.5652 3.5173 3.30% 1.91%

Performance

Here sweep-bench results for the Q8_0 quantized Ministral3-8B-Instruct with Q4_0 K-cache

No Hadamard transform

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 0.127 8042.47 3.513 72.86
1024 256 1024 0.131 7790.92 3.601 71.08
1024 256 2048 0.136 7518.19 3.684 69.48
1024 256 3072 0.142 7217.88 3.787 67.60
1024 256 4096 0.148 6925.42 3.851 66.47
1024 256 5120 0.153 6677.97 3.970 64.48
1024 256 6144 0.158 6490.21 4.027 63.57
1024 256 7168 0.164 6234.21 4.119 62.15
1024 256 8192 0.172 5970.53 4.193 61.06
1024 256 9216 0.177 5777.35 4.263 60.06
1024 256 10240 0.183 5604.48 4.371 58.56
1024 256 11264 0.188 5434.98 4.434 57.74
1024 256 12288 0.193 5314.87 4.537 56.43
1024 256 13312 0.199 5144.31 4.602 55.63
1024 256 14336 0.205 4992.98 4.681 54.69
1024 256 15360 0.211 4861.28 4.761 53.77

With Hadamard transform

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 0.128 7973.22 3.530 72.53
1024 256 1024 0.132 7741.74 3.618 70.76
1024 256 2048 0.138 7443.97 3.700 69.19
1024 256 3072 0.142 7217.16 3.804 67.29
1024 256 4096 0.149 6879.41 3.868 66.18
1024 256 5120 0.155 6617.85 3.987 64.21
1024 256 6144 0.161 6371.01 4.044 63.30
1024 256 7168 0.166 6162.92 4.136 61.90
1024 256 8192 0.172 5956.78 4.208 60.83
1024 256 9216 0.176 5807.56 4.279 59.83
1024 256 10240 0.183 5598.66 4.391 58.30
1024 256 11264 0.189 5409.32 4.451 57.52
1024 256 12288 0.195 5264.40 4.553 56.22
1024 256 13312 0.199 5151.66 4.620 55.42
1024 256 14336 0.204 5008.68 4.696 54.52
1024 256 15360 0.211 4848.51 4.778 53.58

I.e., basically negligible performance impact de to the added Hadamard transform op.

Here sweep-bench results for Ling-Mini-2.0, also with Q4_0 K-cache

No Hadamard transform

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 0.112 18233.94 1.067 480.03
2048 512 2048 0.111 18405.51 1.135 451.09
2048 512 4096 0.117 17456.68 1.202 425.87
2048 512 6144 0.124 16552.30 1.302 393.36
2048 512 8192 0.131 15628.81 1.347 380.21
2048 512 10240 0.138 14873.67 1.440 355.60
2048 512 12288 0.144 14224.10 1.489 343.94
2048 512 14336 0.151 13573.97 1.563 327.49
2048 512 16384 0.158 12940.57 1.646 310.97
2048 512 18432 0.165 12437.15 1.713 298.92
2048 512 20480 0.172 11933.55 1.819 281.46
2048 512 22528 0.177 11586.20 1.885 271.68
2048 512 24576 0.185 11075.48 1.997 256.34
2048 512 26624 0.192 10659.84 2.069 247.50
2048 512 28672 0.199 10287.17 2.173 235.65
2048 512 30720 0.205 9994.39 2.296 223.01

With Hadamard transform

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 0.112 18249.70 1.094 467.96
2048 512 2048 0.112 18325.15 1.163 440.32
2048 512 4096 0.118 17385.84 1.227 417.26
2048 512 6144 0.124 16472.03 1.320 387.77
2048 512 8192 0.131 15648.04 1.376 372.21
2048 512 10240 0.138 14833.49 1.469 348.54
2048 512 12288 0.145 14121.22 1.517 337.48
2048 512 14336 0.151 13526.63 1.591 321.90
2048 512 16384 0.158 12921.54 1.673 305.97
2048 512 18432 0.165 12407.53 1.740 294.29
2048 512 20480 0.171 11952.35 1.847 277.28
2048 512 22528 0.178 11505.23 1.912 267.75
2048 512 24576 0.186 11020.77 2.024 252.94
2048 512 26624 0.194 10569.12 2.097 244.18
2048 512 28672 0.199 10277.05 2.202 232.54
2048 512 30720 0.205 9980.31 2.329 219.85

Here we see basically negligible impact for PP, but nearly 3% drop in TG performance at zero context. My guess is that this is not because of the extra computation, but because of the additional 2 kernel launches per layer, which are not negligible at nearly 500 t/s. This particular model benefited massively from fused operations, increasing performance from about 400 t/s without fusion to 480 t/s with fusion. It is of course possible to think about fused kernels that include the Hadamard transformation, but I leave that for another day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant