Skip to content

Enable AVX-VNNI 256-bit path for Q8_1 R8 dot product#1455

Merged
ikawrakow merged 1 commit intoikawrakow:mainfrom
accaldwell:pr/vnni-q8_1-r8-dot
Mar 18, 2026
Merged

Enable AVX-VNNI 256-bit path for Q8_1 R8 dot product#1455
ikawrakow merged 1 commit intoikawrakow:mainfrom
accaldwell:pr/vnni-q8_1-r8-dot

Conversation

@accaldwell
Copy link
Copy Markdown
Contributor

One-liner ungating the VNNI path for mul_mat_q8_1_r8_q8_2 for HAVE_VNNI256. Covers the (default) non repacked code path for Q4_K, Q5_K, Q4_1 and Q5_1. This gives a +3–21% pp improvement on my hardware.

There's only 256-bit code in this block, so we would expect the change to be safe. I tested perplexity and prompt tested it anyways.

Thanks for your patience while I learn this code base. I realize now I should have just started with perf profiling which immediately identified this matmul as the hottest spot for the qwen35 models I'm using for testing.

Benchmark results

Note: Benchmark speeds are median based which is an experimental change independant of the VNNI code in this PR. No llama-bench changes are being proposed here.

-rtr 0 (default, no runtime repack) — pp512, -t 6

Model Baseline (median t/s) PR (median t/s) Change
Llama-3.2-1B Q4_K_M 239.86 ± 6.00 283.48 ± 2.89 +18.2%
gemma-3-1b Q4_K_S 274.60 ± 2.38 283.63 ± 2.85 +3.3%
Qwen3.5-0.8B Q4_K_S 379.36 ± 2.56 456.19 ± 3.56 +20.3%
Qwen3.5-9B Q4_K_S 34.96 ± 0.45 42.43 ± 0.35 +21.4%
Qwen3.5-35B-A3B Q4_K_S (MoE) 68.91 ± 0.31 72.71 ± 0.20 +5.5%

-rtr 1 (runtime repack) — pp512, -t 6

Model Baseline (median t/s) PR (median t/s) Change
Llama-3.2-1B Q4_K_M 271.86 ± 1.03 274.76 ± 2.17 +1.1%
gemma-3-1b Q4_K_S 299.39 ± 0.42 300.68 ± 1.41 +0.4%
Qwen3.5-0.8B Q4_K_S 426.88 ± 0.52 427.18 ± 4.04 +0.1%
Qwen3.5-9B Q4_K_S 40.14 ± 0.42 41.92 ± 0.62 +4.4%
Qwen3.5-35B-A3B Q4_K_S (MoE) 76.33 ± 0.61 76.22 ± 0.24 -0.1%

With -rtr 1, the Q4_K weights are repacked to Q4_K_R4 at load time and use a different matmul kernel (mul_mat_q4_k_r4_q8_k, already HAVE_VNNI256 enabled in PR #1446).

Quant type comparison (Qwen3.5-2B, -rtr 0) — pp512, -t 6

Quant Baseline (median t/s) PR (median t/s) Change
Q2_K 145.90 ± 0.92 151.55 ± 0.68 +3.9%
Q4_K_M 155.94 ± 1.52 175.47 ± 1.44 +12.5%
Q5_K_M 152.30 ± 0.71 174.31 ± 0.26 +14.4%
Q8_0 135.17 ± 0.64 133.67 ± 0.81 -1.1%

The speedup scales with the fraction of Q4_K/Q5_K weights in a model as expected.

Benchmark test setup

Intel i5-13500, 6 P-cores pinned @ 2.5GHz, turbo off, HT off, E-cores offline. Release build, -DGGML_NATIVE=ON, 8 reps per test.

Perplexity (Qwen3.5-0.8B Q4_K_S, wikitext-2, 580 chunks)

Build PPL
Baseline 19.5512 ± 0.159
PR 19.5512 ± 0.159

Ungate the VNNI path in mul_mat_q8_1_r8_q8_2 by changing the
guard from HAVE_FANCY_SIMD to HAVE_VNNI256. This block only uses
256-bit intrinsics so it is safe for AVX-VNNI (non-512) CPUs.
@accaldwell accaldwell marked this pull request as ready for review March 18, 2026 07:39
@ikawrakow
Copy link
Copy Markdown
Owner

Btw, when you are looking for potential use of AVX512 intrinsics in the parts that prepare the block scales, it is not just a matter of 512-bit or 256-bit instructions to have been used. The implementations that do not use 512-bit instructions but are still within HAVE_FANCY_SIMD guards may still use 256-bit variants of one of the AVX512 extensions.

The main reason there are so many 256-bit variants is that most of this was developed on a Ryzen-7950X CPU (Zen4 core). The Zen4 core executes 512-bit instructions as two 256-bit instructions, so there is no performance gain from using the 512-bit variants. At the same time, it often takes more work to pack the quantized data into 512-bit SIMD registers than 256-bit. As a result, on the Zen4 core, often one gets a lower performance by using 512-bit instructions. The HAVE_FANCY_SIMD 256-bit paths use the VNNI/VL extension for dot product, but they may also use 256-bit instructions that are not available on AVX2 (masked instructions, shifts with 16-bit granularity, shuffles, etc.).

@ikawrakow ikawrakow merged commit 8ccb4f8 into ikawrakow:main Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants