Skip to content

Enable AVX-VNNI 256-bit path for Q4_K and Q5_K R4 matmul#1446

Merged
ikawrakow merged 1 commit intoikawrakow:mainfrom
accaldwell:avx-vnni-q4k-q5k-r4
Mar 17, 2026
Merged

Enable AVX-VNNI 256-bit path for Q4_K and Q5_K R4 matmul#1446
ikawrakow merged 1 commit intoikawrakow:mainfrom
accaldwell:avx-vnni-q4k-q5k-r4

Conversation

@accaldwell
Copy link
Copy Markdown
Contributor

@accaldwell accaldwell commented Mar 17, 2026

Hey @ikawrakow, back with a much more targeted optimization for AVX-VNNI CPUs compared to #1435 (and a human written PR).

This adds the HAVE_VNNI256 macro, which applies to any CPU with AVX-VNNI or AVX512-VNNI+VL. If the extra x86 flavor is undesirable maintenance wise, let me know.

The only other change is changing 4 places to be gated behind HAVE_VNNI256 instead of HAVE_FANCY_SIMD. I chose Q4_K and Q5_K R4 (repacked) since that covers the popular Q4_K_M quants and the repacked mode was faster on this hardware regardless of the changes here.

To the best of my understanding, this code path doesnt involve any AVX-512 code at all, so there should be no difference in running this built for AVX512-VNNI versus AVX-VNNI other than speed.

Headline results for Q4_K_S and Q4_K_M are very good:

up to +13% pp on Q4_K_S with -rtr 1
up to +9% pp on Q4_K_M with -rtr 1 (uses more Q6_K, so this is expected)

My benchmark methodology was much improved as well so I am confident this is a solid improvement.

Below are benchmark results, perplexity testing (identical), and actual prompt testing (identical). The below content was co-written by myself and the coding agent that helped me prepare this PR.

Summary

Enable AVX-VNNI 256-bit vpdpbusd for Q4_K and Q5_K repacked (R4) matmul on CPUs that have VNNI but lack the full AVX-512 set (F+BW+DQ+VL+VNNI) required by HAVE_FANCY_SIMD. This covers Alder Lake, Raptor Lake, Meteor Lake, Sierra Forest and possibly other CPUs.

The change is two files, five hunks — four #ifdef HAVE_FANCY_SIMD guards relaxed to #ifdef HAVE_VNNI256 in the R4 matmul functions, plus a new macro definition.

What changed

  • ggml/src/iqk/iqk_config.h: New HAVE_VNNI256 macro, defined when __AVXVNNI__ or (__AVX512VNNI__ + __AVX512VL__). Separate from HAVE_FANCY_SIMD which requires the full AVX-512 set.
  • ggml/src/iqk/iqk_gemm_kquants.cpp: Four #ifdef HAVE_FANCY_SIMD#ifdef HAVE_VNNI256 in mul_mat_q4_k_r4_q8_k and mul_mat_q5_k_r4_q8_k.

Why this is safe

  1. Only VNNI + AVX2 intrinsics used: _mm256_dpbusd_epi32 (VNNI) and _mm256_cvtepi8_epi32 (AVX2). No AVX-512 BW/DQ instructions.
  2. Scale range: Q4_K and Q5_K scales are 6-bit (0..63). cvtepi8 and cvtepu8 agree for 0..127, so 0..63 is safe.
  3. No int16 saturation risk: Q4 max pair sum = 3810, Q5 max pair sum = 7874 — both within int16 range.
  4. Self-contained: Integer results convert to float and store at function end. No integer intermediates leak to downstream HAVE_FANCY_SIMD-gated code.
  5. Bit-identical perplexity — confirmed on wikitext-2-raw (see below).

Perplexity validation

Model Upstream PPL VNNI PPL Delta
Qwen3.5-0.8B Q4_K_S 19.5512 ± 0.159 19.5512 ± 0.159 0.0000
Llama-3.2-1B Q4_K_S 14.5834 ± 0.109 14.5834 ± 0.109 0.0000
Qwen3.5-0.8B Q4_K_M 19.3688 ± 0.157 19.3688 ± 0.157 0.0000
Llama-3.2-1B Q4_K_M 14.4943 ± 0.109 14.4943 ± 0.109 0.0000

Benchmarks

Test setup: i5-13500, 6 P-cores pinned @ 2.5 GHz, turbo off, HT off, E-cores offline. Release build, -DGGML_NATIVE=ON, 8 repetitions.

Q4_K_S — -rtr 1 (repacked, VNNI-optimized path)

Model Test Upstream (t/s) VNNI (t/s) Change
Qwen3.5-0.8B pp512 378.82 ± 15.47 427.84 ± 4.73 +12.9%
Qwen3.5-0.8B tg128 52.66 ± 0.25 52.66 ± 0.28 0.0%
Llama-3.2-1B pp512 263.52 ± 5.73 285.62 ± 4.16 +8.4%
Llama-3.2-1B tg128 45.43 ± 0.21 45.57 ± 0.25 +0.3%
gemma-3-1b pp512 292.37 ± 1.79 296.17 ± 4.22 +1.3%
gemma-3-1b tg128 42.75 ± 0.19 43.08 ± 0.15 +0.8%

Q4_K_S — -rtr 0 (non-repacked, control — no code change on this path)

Model Test Upstream (t/s) VNNI (t/s) Change
Qwen3.5-0.8B pp512 360.93 ± 8.63 377.83 ± 7.32 +4.7%
Qwen3.5-0.8B tg128 52.81 ± 0.32 52.60 ± 0.45 -0.4%
Llama-3.2-1B pp512 234.36 ± 9.16 241.50 ± 1.54 +3.0%
Llama-3.2-1B tg128 45.29 ± 0.24 45.59 ± 0.51 +0.7%
gemma-3-1b pp512 277.34 ± 1.82 272.86 ± 5.67 -1.6%
gemma-3-1b tg128 41.77 ± 0.16 40.86 ± 1.31 -2.2%

Q4_K_M — -rtr 1 (repacked, VNNI-optimized path)

Model Test Upstream (t/s) VNNI (t/s) Change
Qwen3.5-0.8B pp512 379.27 ± 9.97 414.08 ± 8.51 +9.2%
Qwen3.5-0.8B tg128 50.50 ± 0.33 50.25 ± 0.23 -0.5%
Llama-3.2-1B pp512 253.44 ± 4.18 269.40 ± 3.27 +6.3%
Llama-3.2-1B tg128 43.18 ± 0.11 43.08 ± 0.34 -0.2%
gemma-3-1b pp512 285.70 ± 3.08 286.00 ± 4.34 +0.1%
gemma-3-1b tg128 41.26 ± 0.27 41.37 ± 0.16 +0.3%

Q4_K_M — -rtr 0 (non-repacked, control — no code change on this path)

Model Test Upstream (t/s) VNNI (t/s) Change
Qwen3.5-0.8B pp512 364.34 ± 7.72 353.98 ± 12.58 -2.8%
Qwen3.5-0.8B tg128 50.19 ± 0.47 49.23 ± 0.84 -1.9%
Llama-3.2-1B pp512 226.44 ± 3.62 225.86 ± 3.66 -0.3%
Llama-3.2-1B tg128 41.57 ± 0.54 41.90 ± 0.53 +0.8%
gemma-3-1b pp512 271.59 ± 4.18 265.36 ± 4.81 -2.3%
gemma-3-1b tg128 39.39 ± 0.69 40.04 ± 0.81 +1.7%

Text generation QA

To verify that the VNNI path produces correct output beyond perplexity, we ran llama-cli in non-interactive mode with a fixed seed (42) across 5 general knowledge prompts, 3 models (Qwen3.5-0.8B, Llama-3.2-1B, gemma-3-1b, all Q4_K_M), and both -rtr 0 and -rtr 1 — 30 comparisons total.

All 30 upstream/VNNI output pairs are byte-identical (verified via SHA-256). Human review spot-checked the outputs to confirm the generated text is coherent and intelligible.

Prompts used
  1. "What is the difference between Celsius and Kelvin?"
  2. "Explain why the sky appears blue during the day."
  3. "What causes tides in the ocean?"
  4. "How does a battery store and release energy?"
  5. "Why do we have leap years?"

Add new CPU macro HAVE_VNNI256 for CPUs with 256-bit VNNI
(AVX-VNNI) support or better (AVX512-VNNI+VL), separate from
HAVE_FANCY_SIMD which requires the full AVX-512 set. Relax four
#ifdef guards in mul_mat_q4_k_r4_q8_k and mul_mat_q5_k_r4_q8_k
to use HAVE_VNNI256 instead of HAVE_FANCY_SIMD, enabling vpdpbusd
and cvtepi8_epi32 on Alder Lake, Raptor Lake, and similar CPUs.
@accaldwell accaldwell marked this pull request as ready for review March 17, 2026 05:33
Copy link
Copy Markdown
Owner

@ikawrakow ikawrakow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This is much easier!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants