Enable AVX-VNNI 256-bit path for Q4_K and Q5_K R4 matmul by accaldwell · Pull Request #1446 · ikawrakow/ik_llama.cpp

accaldwell · 2026-03-17T05:21:32Z

Hey @ikawrakow, back with a much more targeted optimization for AVX-VNNI CPUs compared to #1435 (and a human written PR).

This adds the HAVE_VNNI256 macro, which applies to any CPU with AVX-VNNI or AVX512-VNNI+VL. If the extra x86 flavor is undesirable maintenance wise, let me know.

The only other change is changing 4 places to be gated behind HAVE_VNNI256 instead of HAVE_FANCY_SIMD. I chose Q4_K and Q5_K R4 (repacked) since that covers the popular Q4_K_M quants and the repacked mode was faster on this hardware regardless of the changes here.

To the best of my understanding, this code path doesnt involve any AVX-512 code at all, so there should be no difference in running this built for AVX512-VNNI versus AVX-VNNI other than speed.

Headline results for Q4_K_S and Q4_K_M are very good:

up to +13% pp on Q4_K_S with -rtr 1
up to +9% pp on Q4_K_M with -rtr 1 (uses more Q6_K, so this is expected)

My benchmark methodology was much improved as well so I am confident this is a solid improvement.

Below are benchmark results, perplexity testing (identical), and actual prompt testing (identical). The below content was co-written by myself and the coding agent that helped me prepare this PR.

Summary

Enable AVX-VNNI 256-bit vpdpbusd for Q4_K and Q5_K repacked (R4) matmul on CPUs that have VNNI but lack the full AVX-512 set (F+BW+DQ+VL+VNNI) required by HAVE_FANCY_SIMD. This covers Alder Lake, Raptor Lake, Meteor Lake, Sierra Forest and possibly other CPUs.

The change is two files, five hunks — four #ifdef HAVE_FANCY_SIMD guards relaxed to #ifdef HAVE_VNNI256 in the R4 matmul functions, plus a new macro definition.

What changed

ggml/src/iqk/iqk_config.h: New HAVE_VNNI256 macro, defined when __AVXVNNI__ or (__AVX512VNNI__ + __AVX512VL__). Separate from HAVE_FANCY_SIMD which requires the full AVX-512 set.
ggml/src/iqk/iqk_gemm_kquants.cpp: Four #ifdef HAVE_FANCY_SIMD → #ifdef HAVE_VNNI256 in mul_mat_q4_k_r4_q8_k and mul_mat_q5_k_r4_q8_k.

Why this is safe

Only VNNI + AVX2 intrinsics used: _mm256_dpbusd_epi32 (VNNI) and _mm256_cvtepi8_epi32 (AVX2). No AVX-512 BW/DQ instructions.
Scale range: Q4_K and Q5_K scales are 6-bit (0..63). cvtepi8 and cvtepu8 agree for 0..127, so 0..63 is safe.
No int16 saturation risk: Q4 max pair sum = 3810, Q5 max pair sum = 7874 — both within int16 range.
Self-contained: Integer results convert to float and store at function end. No integer intermediates leak to downstream HAVE_FANCY_SIMD-gated code.
Bit-identical perplexity — confirmed on wikitext-2-raw (see below).

Perplexity validation

Model	Upstream PPL	VNNI PPL
Qwen3.5-0.8B Q4_K_S	19.5512 ± 0.159	19.5512 ± 0.159
Llama-3.2-1B Q4_K_S	14.5834 ± 0.109	14.5834 ± 0.109
Qwen3.5-0.8B Q4_K_M	19.3688 ± 0.157	19.3688 ± 0.157
Llama-3.2-1B Q4_K_M	14.4943 ± 0.109	14.4943 ± 0.109

Benchmarks

Test setup: i5-13500, 6 P-cores pinned @ 2.5 GHz, turbo off, HT off, E-cores offline. Release build, -DGGML_NATIVE=ON, 8 repetitions.

Q4_K_S — `-rtr 1` (repacked, VNNI-optimized path)

Model	Test	Upstream (t/s)	VNNI (t/s)	Change
Qwen3.5-0.8B	pp512	378.82 ± 15.47	427.84 ± 4.73	+12.9%
Qwen3.5-0.8B	tg128	52.66 ± 0.25	52.66 ± 0.28	0.0%
Llama-3.2-1B	pp512	263.52 ± 5.73	285.62 ± 4.16	+8.4%
Llama-3.2-1B	tg128	45.43 ± 0.21	45.57 ± 0.25	+0.3%
gemma-3-1b	pp512	292.37 ± 1.79	296.17 ± 4.22	+1.3%
gemma-3-1b	tg128	42.75 ± 0.19	43.08 ± 0.15	+0.8%

Q4_K_S — `-rtr 0` (non-repacked, control — no code change on this path)

Model	Test	Upstream (t/s)	VNNI (t/s)	Change
Qwen3.5-0.8B	pp512	360.93 ± 8.63	377.83 ± 7.32	+4.7%
Qwen3.5-0.8B	tg128	52.81 ± 0.32	52.60 ± 0.45	-0.4%
Llama-3.2-1B	pp512	234.36 ± 9.16	241.50 ± 1.54	+3.0%
Llama-3.2-1B	tg128	45.29 ± 0.24	45.59 ± 0.51	+0.7%
gemma-3-1b	pp512	277.34 ± 1.82	272.86 ± 5.67	-1.6%
gemma-3-1b	tg128	41.77 ± 0.16	40.86 ± 1.31	-2.2%

Q4_K_M — `-rtr 1` (repacked, VNNI-optimized path)

Model	Test	Upstream (t/s)	VNNI (t/s)	Change
Qwen3.5-0.8B	pp512	379.27 ± 9.97	414.08 ± 8.51	+9.2%
Qwen3.5-0.8B	tg128	50.50 ± 0.33	50.25 ± 0.23	-0.5%
Llama-3.2-1B	pp512	253.44 ± 4.18	269.40 ± 3.27	+6.3%
Llama-3.2-1B	tg128	43.18 ± 0.11	43.08 ± 0.34	-0.2%
gemma-3-1b	pp512	285.70 ± 3.08	286.00 ± 4.34	+0.1%
gemma-3-1b	tg128	41.26 ± 0.27	41.37 ± 0.16	+0.3%

Q4_K_M — `-rtr 0` (non-repacked, control — no code change on this path)

Model	Test	Upstream (t/s)	VNNI (t/s)	Change
Qwen3.5-0.8B	pp512	364.34 ± 7.72	353.98 ± 12.58	-2.8%
Qwen3.5-0.8B	tg128	50.19 ± 0.47	49.23 ± 0.84	-1.9%
Llama-3.2-1B	pp512	226.44 ± 3.62	225.86 ± 3.66	-0.3%
Llama-3.2-1B	tg128	41.57 ± 0.54	41.90 ± 0.53	+0.8%
gemma-3-1b	pp512	271.59 ± 4.18	265.36 ± 4.81	-2.3%
gemma-3-1b	tg128	39.39 ± 0.69	40.04 ± 0.81	+1.7%

Text generation QA

To verify that the VNNI path produces correct output beyond perplexity, we ran llama-cli in non-interactive mode with a fixed seed (42) across 5 general knowledge prompts, 3 models (Qwen3.5-0.8B, Llama-3.2-1B, gemma-3-1b, all Q4_K_M), and both -rtr 0 and -rtr 1 — 30 comparisons total.

All 30 upstream/VNNI output pairs are byte-identical (verified via SHA-256). Human review spot-checked the outputs to confirm the generated text is coherent and intelligible.

Prompts used

"What is the difference between Celsius and Kelvin?"
"Explain why the sky appears blue during the day."
"What causes tides in the ocean?"
"How does a battery store and release energy?"
"Why do we have leap years?"

Add new CPU macro HAVE_VNNI256 for CPUs with 256-bit VNNI (AVX-VNNI) support or better (AVX512-VNNI+VL), separate from HAVE_FANCY_SIMD which requires the full AVX-512 set. Relax four #ifdef guards in mul_mat_q4_k_r4_q8_k and mul_mat_q5_k_r4_q8_k to use HAVE_VNNI256 instead of HAVE_FANCY_SIMD, enabling vpdpbusd and cvtepi8_epi32 on Alder Lake, Raptor Lake, and similar CPUs.

ikawrakow

Thank you. This is much easier!

accaldwell marked this pull request as ready for review March 17, 2026 05:33

ikawrakow approved these changes Mar 17, 2026

View reviewed changes

ikawrakow merged commit 008125b into ikawrakow:main Mar 17, 2026

accaldwell mentioned this pull request Mar 18, 2026

Enable AVX-VNNI 256-bit path for Q8_1 R8 dot product #1455

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable AVX-VNNI 256-bit path for Q4_K and Q5_K R4 matmul#1446

Enable AVX-VNNI 256-bit path for Q4_K and Q5_K R4 matmul#1446
ikawrakow merged 1 commit intoikawrakow:mainfrom
accaldwell:avx-vnni-q4k-q5k-r4

accaldwell commented Mar 17, 2026 •

edited

Loading

Uh oh!

ikawrakow left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

accaldwell commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Why this is safe

Perplexity validation

Benchmarks

Q4_K_S — -rtr 1 (repacked, VNNI-optimized path)

Q4_K_S — -rtr 0 (non-repacked, control — no code change on this path)

Q4_K_M — -rtr 1 (repacked, VNNI-optimized path)

Q4_K_M — -rtr 0 (non-repacked, control — no code change on this path)

Text generation QA

Uh oh!

ikawrakow left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

accaldwell commented Mar 17, 2026 •

edited

Loading

Q4_K_S — `-rtr 1` (repacked, VNNI-optimized path)

Q4_K_S — `-rtr 0` (non-repacked, control — no code change on this path)

Q4_K_M — `-rtr 1` (repacked, VNNI-optimized path)

Q4_K_M — `-rtr 0` (non-repacked, control — no code change on this path)