Enable AVX-VNNI 256-bit path for Q4_K and Q5_K R4 matmul#1446
Merged
ikawrakow merged 1 commit intoikawrakow:mainfrom Mar 17, 2026
Merged
Enable AVX-VNNI 256-bit path for Q4_K and Q5_K R4 matmul#1446ikawrakow merged 1 commit intoikawrakow:mainfrom
ikawrakow merged 1 commit intoikawrakow:mainfrom
Conversation
Add new CPU macro HAVE_VNNI256 for CPUs with 256-bit VNNI (AVX-VNNI) support or better (AVX512-VNNI+VL), separate from HAVE_FANCY_SIMD which requires the full AVX-512 set. Relax four #ifdef guards in mul_mat_q4_k_r4_q8_k and mul_mat_q5_k_r4_q8_k to use HAVE_VNNI256 instead of HAVE_FANCY_SIMD, enabling vpdpbusd and cvtepi8_epi32 on Alder Lake, Raptor Lake, and similar CPUs.
ikawrakow
approved these changes
Mar 17, 2026
Owner
ikawrakow
left a comment
There was a problem hiding this comment.
Thank you. This is much easier!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hey @ikawrakow, back with a much more targeted optimization for AVX-VNNI CPUs compared to #1435 (and a human written PR).
This adds the HAVE_VNNI256 macro, which applies to any CPU with AVX-VNNI or AVX512-VNNI+VL. If the extra x86 flavor is undesirable maintenance wise, let me know.
The only other change is changing 4 places to be gated behind HAVE_VNNI256 instead of HAVE_FANCY_SIMD. I chose Q4_K and Q5_K R4 (repacked) since that covers the popular Q4_K_M quants and the repacked mode was faster on this hardware regardless of the changes here.
To the best of my understanding, this code path doesnt involve any AVX-512 code at all, so there should be no difference in running this built for AVX512-VNNI versus AVX-VNNI other than speed.
Headline results for Q4_K_S and Q4_K_M are very good:
up to +13% pp on Q4_K_S with -rtr 1
up to +9% pp on Q4_K_M with -rtr 1 (uses more Q6_K, so this is expected)
My benchmark methodology was much improved as well so I am confident this is a solid improvement.
Below are benchmark results, perplexity testing (identical), and actual prompt testing (identical). The below content was co-written by myself and the coding agent that helped me prepare this PR.
Summary
Enable AVX-VNNI 256-bit
vpdpbusdfor Q4_K and Q5_K repacked (R4) matmul on CPUs that have VNNI but lack the full AVX-512 set (F+BW+DQ+VL+VNNI) required byHAVE_FANCY_SIMD. This covers Alder Lake, Raptor Lake, Meteor Lake, Sierra Forest and possibly other CPUs.The change is two files, five hunks — four
#ifdef HAVE_FANCY_SIMDguards relaxed to#ifdef HAVE_VNNI256in the R4 matmul functions, plus a new macro definition.What changed
ggml/src/iqk/iqk_config.h: NewHAVE_VNNI256macro, defined when__AVXVNNI__or (__AVX512VNNI__+__AVX512VL__). Separate fromHAVE_FANCY_SIMDwhich requires the full AVX-512 set.ggml/src/iqk/iqk_gemm_kquants.cpp: Four#ifdef HAVE_FANCY_SIMD→#ifdef HAVE_VNNI256inmul_mat_q4_k_r4_q8_kandmul_mat_q5_k_r4_q8_k.Why this is safe
_mm256_dpbusd_epi32(VNNI) and_mm256_cvtepi8_epi32(AVX2). No AVX-512 BW/DQ instructions.cvtepi8andcvtepu8agree for 0..127, so 0..63 is safe.HAVE_FANCY_SIMD-gated code.Perplexity validation
Benchmarks
Test setup: i5-13500, 6 P-cores pinned @ 2.5 GHz, turbo off, HT off, E-cores offline. Release build,
-DGGML_NATIVE=ON, 8 repetitions.Q4_K_S —
-rtr 1(repacked, VNNI-optimized path)Q4_K_S —
-rtr 0(non-repacked, control — no code change on this path)Q4_K_M —
-rtr 1(repacked, VNNI-optimized path)Q4_K_M —
-rtr 0(non-repacked, control — no code change on this path)Text generation QA
To verify that the VNNI path produces correct output beyond perplexity, we ran
llama-cliin non-interactive mode with a fixed seed (42) across 5 general knowledge prompts, 3 models (Qwen3.5-0.8B, Llama-3.2-1B, gemma-3-1b, all Q4_K_M), and both-rtr 0and-rtr 1— 30 comparisons total.All 30 upstream/VNNI output pairs are byte-identical (verified via SHA-256). Human review spot-checked the outputs to confirm the generated text is coherent and intelligible.
Prompts used