K-cache Hadamard transforms (CUDA) by ikawrakow · Pull Request #1034 · ikawrakow/ik_llama.cpp

ikawrakow · 2025-12-04T08:08:41Z

This PR is a follow up of #1033 and adds support for K-cache Hadamard transforms on CUDA.

There are various CUDA Hadamard transform implementations on the Internet, but me being me and not liking the addition of external dependencies (or copy pasting a pile of code that I don't understand), I rolled my own. It is possibly not the fastest possible implementation, but based on performance benchmarks (see below) it cannot be totally bad either.

Being able to run PPL calculations on the GPU more quickly, here are some results for a bunch of models. In all cases the V-cache is left as f16.

Qwen3-30B-A3B, `IQ2_XXS` quantization

The Q6_0 result is somewhat peculiar, but Qwen3-30B-A3B does show some strange behavior when it comes to PPL (see #359)

K-cache	PPL (no Hadamard)	PPL (Hadamard)	Diff to f16 (no H)	Diff to f16 (H)
f16	10.5681	N/A	N/A	N/A
Q8_0	10.5727	10.5758	0.04%	0.07%
Q6_0	10.4908	10.5779	-0.73%	0.09%
Q5_0	10.6518	10.5959	0.79%	0.26%
Q4_0	11.7625	10.6654	11.30%	0.92%

Ling-Mini-2.0, Q4_K_M quantization

This model clearly does not like K-cache quantization with less than 8 bpw.

K-cache	PPL (no Hadamard)	PPL (Hadamard)	Diff to f16 (no H)	Diff to f16 (H)
f16	13.3744	N/A	N/A	N/A
Q8_0	13.3699	13.3659	-0.03%	-0.06%
Q6_0	13.6463	13.4754	2.03%	0.76%
Q5_0	14.4198	13.5800	7.82%	1.54%
Q4_0	14.3242	14.0298	7.10%	4.90%

Ministral3-8B-Instruct, Q8_0 quantization

This model does not mind quantized K-cache.

K-cache	PPL (no Hadamard)	PPL (Hadamard)	Diff to f16 (no H)	Diff to f16 (H)
f16	7.5939	N/A	N/A	N/A
Q8_0	7.5935	7.5943	0.00%	0.00%
Q6_0	7.5999	7.5940	0.08%	0.00%
Q5_0	7.6097	7.5995	0.21%	0.07%
Q4_0	7.6398	7.6281	0.60%	0.45%

TheDrummer_Tiger-Gemma-12B-v3, IQ4_NL

K-cache	PPL (no Hadamard)	PPL (Hadamard)	Diff to f16 (no H)	Diff to f16 (H)
f16	8.7003	N/A	N/A	N/A
Q8_0	8.6972	8.6988	-0.04%	-0.02%
Q6_0	8.7013	8.7007	0.01%	0.00%
Q5_0	8.7236	8.6980	0.27%	-0.03%
Q4_0	8.8242	8.7641	1.42%	0.73%

GLM-4.6, 5.5 bpw Thireus mix

It takes about 17 minutes to run one PPL calculation on my 2x3090 box, so here fewer results.

K-cache	PPL (no Hadamard)	PPL (Hadamard)	Diff to f16 (no H)	Diff to f16 (H)
f16	3.4513	N/A	N/A	N/A
Q6_0	3.4557	3.4573	0.13%	0.17%
Q4_0	3.5652	3.5173	3.30%	1.91%

Performance

Here sweep-bench results for the Q8_0 quantized Ministral3-8B-Instruct with Q4_0 K-cache

No Hadamard transform

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	0.127	8042.47	3.513	72.86
1024	256	1024	0.131	7790.92	3.601	71.08
1024	256	2048	0.136	7518.19	3.684	69.48
1024	256	3072	0.142	7217.88	3.787	67.60
1024	256	4096	0.148	6925.42	3.851	66.47
1024	256	5120	0.153	6677.97	3.970	64.48
1024	256	6144	0.158	6490.21	4.027	63.57
1024	256	7168	0.164	6234.21	4.119	62.15
1024	256	8192	0.172	5970.53	4.193	61.06
1024	256	9216	0.177	5777.35	4.263	60.06
1024	256	10240	0.183	5604.48	4.371	58.56
1024	256	11264	0.188	5434.98	4.434	57.74
1024	256	12288	0.193	5314.87	4.537	56.43
1024	256	13312	0.199	5144.31	4.602	55.63
1024	256	14336	0.205	4992.98	4.681	54.69
1024	256	15360	0.211	4861.28	4.761	53.77

With Hadamard transform

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	0.128	7973.22	3.530	72.53
1024	256	1024	0.132	7741.74	3.618	70.76
1024	256	2048	0.138	7443.97	3.700	69.19
1024	256	3072	0.142	7217.16	3.804	67.29
1024	256	4096	0.149	6879.41	3.868	66.18
1024	256	5120	0.155	6617.85	3.987	64.21
1024	256	6144	0.161	6371.01	4.044	63.30
1024	256	7168	0.166	6162.92	4.136	61.90
1024	256	8192	0.172	5956.78	4.208	60.83
1024	256	9216	0.176	5807.56	4.279	59.83
1024	256	10240	0.183	5598.66	4.391	58.30
1024	256	11264	0.189	5409.32	4.451	57.52
1024	256	12288	0.195	5264.40	4.553	56.22
1024	256	13312	0.199	5151.66	4.620	55.42
1024	256	14336	0.204	5008.68	4.696	54.52
1024	256	15360	0.211	4848.51	4.778	53.58

I.e., basically negligible performance impact de to the added Hadamard transform op.

Here sweep-bench results for Ling-Mini-2.0, also with Q4_0 K-cache

No Hadamard transform

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	0.112	18233.94	1.067	480.03
2048	512	2048	0.111	18405.51	1.135	451.09
2048	512	4096	0.117	17456.68	1.202	425.87
2048	512	6144	0.124	16552.30	1.302	393.36
2048	512	8192	0.131	15628.81	1.347	380.21
2048	512	10240	0.138	14873.67	1.440	355.60
2048	512	12288	0.144	14224.10	1.489	343.94
2048	512	14336	0.151	13573.97	1.563	327.49
2048	512	16384	0.158	12940.57	1.646	310.97
2048	512	18432	0.165	12437.15	1.713	298.92
2048	512	20480	0.172	11933.55	1.819	281.46
2048	512	22528	0.177	11586.20	1.885	271.68
2048	512	24576	0.185	11075.48	1.997	256.34
2048	512	26624	0.192	10659.84	2.069	247.50
2048	512	28672	0.199	10287.17	2.173	235.65
2048	512	30720	0.205	9994.39	2.296	223.01

With Hadamard transform

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	0.112	18249.70	1.094	467.96
2048	512	2048	0.112	18325.15	1.163	440.32
2048	512	4096	0.118	17385.84	1.227	417.26
2048	512	6144	0.124	16472.03	1.320	387.77
2048	512	8192	0.131	15648.04	1.376	372.21
2048	512	10240	0.138	14833.49	1.469	348.54
2048	512	12288	0.145	14121.22	1.517	337.48
2048	512	14336	0.151	13526.63	1.591	321.90
2048	512	16384	0.158	12921.54	1.673	305.97
2048	512	18432	0.165	12407.53	1.740	294.29
2048	512	20480	0.171	11952.35	1.847	277.28
2048	512	22528	0.178	11505.23	1.912	267.75
2048	512	24576	0.186	11020.77	2.024	252.94
2048	512	26624	0.194	10569.12	2.097	244.18
2048	512	28672	0.199	10277.05	2.202	232.54
2048	512	30720	0.205	9980.31	2.329	219.85

Here we see basically negligible impact for PP, but nearly 3% drop in TG performance at zero context. My guess is that this is not because of the extra computation, but because of the additional 2 kernel launches per layer, which are not negligible at nearly 500 t/s. This particular model benefited massively from fused operations, increasing performance from about 400 t/s without fusion to 480 t/s with fusion. It is of course possible to think about fused kernels that include the Hadamard transformation, but I leave that for another day.

Iwan Kawrakow added 2 commits December 4, 2025 05:52

Hadamard transforms for K-cache on CUDA

15b4864

Minor

c5f3ba9

ikawrakow merged commit b715342 into main Dec 4, 2025

This was referenced Mar 27, 2026

V-cache Hadamard transform #1527

Merged

Even better Q4_0 KV cache #1547

Merged

TurboQuant KV Cache Compression — Working Implementation Ready for Review #1509

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K-cache Hadamard transforms (CUDA)#1034

K-cache Hadamard transforms (CUDA)#1034
ikawrakow merged 2 commits intomainfrom
ik/k_cache_hadamard_cuda

ikawrakow commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ikawrakow commented Dec 4, 2025

Qwen3-30B-A3B, IQ2_XXS quantization

Ling-Mini-2.0, Q4_K_M quantization

Ministral3-8B-Instruct, Q8_0 quantization

TheDrummer_Tiger-Gemma-12B-v3, IQ4_NL

GLM-4.6, 5.5 bpw Thireus mix

Performance

No Hadamard transform

With Hadamard transform

No Hadamard transform

With Hadamard transform

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Qwen3-30B-A3B, `IQ2_XXS` quantization