Enable AVX-VNNI 256-bit path for Q3_K R4 matmul by accaldwell · Pull Request #1472 · ikawrakow/ik_llama.cpp

accaldwell · 2026-03-20T03:59:42Z

Enable VNNI (vpdpbusd and vpdpwssd) for 256-bit VNNI in mul_mat_q3_k_r4_q8_k. This relaxes FANCY_SIMD to HAVE_VNNI256 and refactors it to use a method between the old FANCY and fallback modes.

Why not just ungate the existing FANCY_SIMD code? In performance profiling of token generation, this just moved the compute bottleneck to vpmulld. Token generation isnt compute limited, but the PR approach is more efficient and performance profiling shows it is now even more dominated by vmovdqu (memory loads/stores). Its important to note this also changes how AVX-512 platforms run this kernel - its possible the original code is better on AVX-512 CPUs and this PR will cause a regression. I am willing to do AVX-512 testing if needed.

Prompt processing naturally benefits from this PR due to the now familiar change to use VNNI fused instructions in place of the AVX2 alternatives.

Performance

Qwen3.5-2B, Q3_K_M, -rtr 1

QA

Qwen3.5-2B, Q3_K_M, --run-time-repack

Text generations on general knowledge questions were bit identical. Perplexity testing was identical on a full run (13.8283 +/- 0.10343).

Details

More details below the fold (details section co-written with an agent)

Details

Approach

A naive port of the HAVE_FANCY_SIMD path introduced a vpmulld bottleneck. The original AVX-512 path deferred float conversion and accumulated integer results using _mm256_mullo_epi32(iscales, sumi). On Raptor Lake P-cores, vpmulld consumed ~15% of the function's cycles, replacing the memory-move stall as the dominant bottleneck.

The hybrid approach taken here uses VNNI for the dot products only, while keeping the baseline's per-sub-block float accumulation strategy:

_mm256_maddubs_epi16 + _mm256_add_epi16 + _mm256_madd_epi16 (3 instructions) replaced by _mm256_dpbusd_epi32 (1 instruction)
_mm256_madd_epi16 + _mm256_add_epi32 replaced by _mm256_dpwssd_epi32 for bsums correction
No isum integer accumulator, no vpmulld, no deferred float conversion

PP perf profile (pp512, 6 threads)

Metric	Baseline	Hybrid (VNNI256)	Delta
Throughput	295 t/s	319 t/s	+8.1%
Instructions	658.9B	610.2B	−7.4%
Cycles	220.0B	202.9B	−7.8%
IPC	3.00	3.01	~same

The speedup comes entirely from executing fewer instructions at the same IPC. The VNNI path replaces a 3-instruction dot product chain with a 2-instruction chain:

Instruction	Baseline	Hybrid
vpmaddubsw	32	--
vpaddw	24	--
vpmaddwd	40	--
vpdpbusd	--	32
vpdpwssd	--	32
Dot product total	96	64

96 → 64 dot product instructions (−33%). Total function instructions: 820 → 779 (−5%). The float accumulation path (vcvtdq2ps, vfmadd*) is unchanged at 33 instructions each.

Instruction profile, prompt processing (mul_mat_q3_k_r4_q8_k, top instructions)

Instruction	Baseline	Hybrid
vpdpbusd	--	~20%
vpshufd	~14%	~16%
vpmaddubsw	~14%	--
vpaddw	~11%	--
vfmadd213ps	~9%	~6%
vmovdqu (move)	~5%	~8%
vcvtdq2ps	~6%	~7%
vpmaddwd	~6%	--
vpsrlw	~5%	~6%
vmovaps	~5%	~4%
vpdpwssd	--	~3%
vmulps	~3%	~3%
vfmadd231ps	~3%	~2%

In PP, vmovdqu is only ~5-8% of cycles (vs ~31-36% in TG), confirming the workload is compute-bound. The dot product chain (vpmaddubsw + vpaddw + vpmaddwd = ~31%) is replaced by vpdpbusd + vpdpwssd (~23%), directly reducing compute pressure.

Instruction profile, token generation (mul_mat_q3_k_r4_q8_k, top instructions)

Instruction	Baseline	Hybrid
vmovdqu (move)	~31%	~36%
vpmaddubsw	~10%	--
vpmaddwd	~10%	--
vpdpbusd	--	~11%
vpmulld	--	--

The hybrid path eliminates vpmaddubsw + vpmaddwd in favor of vpdpbusd, with no vpmulld bottleneck. The memory move (vmovdqu) is now more dominant, confirming the workload is cleanly memory-bound.

Why not a naive port of HAVE_FANCY_SIMD?

A direct HAVE_FANCY_SIMD -> HAVE_VNNI256 substitution was tested first. That path defers float conversion and uses_mm256_mullo_epi32(iscales, sumi) to accumulate integer results. On Raptor Lake P-cores, vpmulld consumed ~15% of the function's cycles — becoming the second hottest instruction after the memory move and replacing the memory stall as the compute bottleneck. On E-cores (Gracemont here) the problem was worse: vpmulld hit ~13% and the overall throughput regressed by ~5% (20.96 vs 22.04 t/s).

accaldwell · 2026-03-20T05:39:17Z

Unsurprisingly, this benefits anywhere Q3_K is found:

ikawrakow · 2026-03-20T08:07:48Z

its possible the original code is better on AVX-512 CPUs and this PR will cause a regression.

It does indeed cause a regression. On my Ryzen-7950X CPU with LlaMA-3.1-8B quantized to Q3_K_S using -rtr

Main branch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	64	0	4.039	253.51	3.806	16.81
1024	64	1024	4.341	235.88	4.058	15.77
1024	64	2048	4.632	221.05	4.127	15.51
1024	64	3072	4.950	206.89	4.359	14.68
1024	64	4096	5.253	194.93	4.522	14.15
1024	64	5120	5.567	183.94	4.687	13.66
1024	64	6144	5.899	173.58	4.850	13.20
1024	64	7168	6.235	164.23	5.010	12.77
1024	64	8192	6.588	155.44	5.170	12.38

This PR

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	64	0	4.353	235.26	3.795	16.86
1024	64	1024	4.638	220.79	4.068	15.73
1024	64	2048	4.947	206.99	4.176	15.32
1024	64	3072	5.272	194.23	4.360	14.68
1024	64	4096	5.577	183.61	4.519	14.16
1024	64	5120	5.893	173.78	4.679	13.68
1024	64	6144	6.213	164.82	4.840	13.22
1024	64	7168	6.551	156.31	5.006	12.79
1024	64	8192	6.900	148.40	5.163	12.40

TG is within the noise (because memory bandwidth bound). PP with this PR is ~8% lower. The HAVE_FANCY_SIMD implementation is like this not by a random chance.

So, I think you need to remove the changes to the HAVE_FANCY_SIMD path and have a separate path for HAVE_VNNI.

Juts for fun, here is what one gets without rtr on the Ryzen-7950X

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	64	0	2.570	398.44	3.788	16.89
1024	64	1024	2.863	357.62	3.964	16.15
1024	64	2048	3.152	324.89	4.100	15.61
1024	64	3072	3.455	296.42	4.242	15.09
1024	64	4096	3.744	273.49	4.387	14.59
1024	64	5120	4.048	252.94	4.535	14.11
1024	64	6144	4.377	233.96	4.681	13.67
1024	64	7168	4.714	217.22	4.825	13.26
1024	64	8192	5.056	202.55	4.974	12.87

accaldwell · 2026-03-20T09:18:56Z

Refactoring this to have a FANCY path with the original code instead of sending them to the new VNNI256 path should be straightforward, I'll work on a v2

Add a separate HAVE_VNNI256 code path using _mm256_dpwssd_epi32 and _mm256_dpbusd_epi32 for the Q3_K R4 kernel. The existing HAVE_FANCY_SIMD (AVX-512 VNNI) path is preserved unchanged.

accaldwell · 2026-03-20T15:49:03Z

The initial version of this patch had a two tier hierarchy: VNNI256 (including FANCY), falling back to AVX2. I changed the existing FANCY code in a way that worked well for me on VNNI256, but caused a regression on FANCY.

The new approach is three tier: FANCY keeps its original code, falling back to VNNI256, then to AVX2.

Even aside from the confirmed regression with my previous attempt, this feels like a better, more conservative change.

I benchmarked pp against 0c9bc3e with and without this patch: 248.51 ± 0.79 -> 273.70 ± 1.41 on Qwen3.5-2B Q3_K_M. Text generation is a perfect match. I didnt test perplexity but I'm happy to do so if needed

accaldwell marked this pull request as ready for review March 20, 2026 04:15

accaldwell marked this pull request as draft March 20, 2026 09:20

Enable AVX-VNNI 256-bit path for Q3_K R4 matmul

6e6a98f

Add a separate HAVE_VNNI256 code path using _mm256_dpwssd_epi32 and _mm256_dpbusd_epi32 for the Q3_K R4 kernel. The existing HAVE_FANCY_SIMD (AVX-512 VNNI) path is preserved unchanged.

accaldwell force-pushed the ac/vnni-q3k-r4 branch from 598c048 to 6e6a98f Compare March 20, 2026 15:34

accaldwell marked this pull request as ready for review March 20, 2026 15:49

accaldwell mentioned this pull request Mar 20, 2026

Enable AVX-VNNI 256-bit path for Q6_K R4 matmul #1482

Merged

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 21, 2026

Enable AVX-VNNI 256-bit path for Q3_K R4 matmul (ikawrakow#1472]

d531634

ikawrakow approved these changes Mar 23, 2026

View reviewed changes

ikawrakow merged commit 47e4015 into ikawrakow:main Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable AVX-VNNI 256-bit path for Q3_K R4 matmul#1472

Enable AVX-VNNI 256-bit path for Q3_K R4 matmul#1472
ikawrakow merged 1 commit intoikawrakow:mainfrom
accaldwell:ac/vnni-q3k-r4

accaldwell commented Mar 20, 2026 •

edited

Loading

Uh oh!

accaldwell commented Mar 20, 2026

Uh oh!

ikawrakow commented Mar 20, 2026

Uh oh!

accaldwell commented Mar 20, 2026

Uh oh!

accaldwell commented Mar 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

accaldwell commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

QA

Details

Approach

PP perf profile (pp512, 6 threads)

Instruction profile, prompt processing (mul_mat_q3_k_r4_q8_k, top instructions)

Instruction profile, token generation (mul_mat_q3_k_r4_q8_k, top instructions)

Why not a naive port of HAVE_FANCY_SIMD?

Uh oh!

accaldwell commented Mar 20, 2026

Uh oh!

ikawrakow commented Mar 20, 2026

Main branch

This PR

Uh oh!

accaldwell commented Mar 20, 2026

Uh oh!

accaldwell commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

accaldwell commented Mar 20, 2026 •

edited

Loading

accaldwell commented Mar 20, 2026 •

edited

Loading