Enable AVX-VNNI 256-bit path for Q8_1 R8 dot product by accaldwell · Pull Request #1455 · ikawrakow/ik_llama.cpp

accaldwell · 2026-03-18T07:27:51Z

One-liner ungating the VNNI path for mul_mat_q8_1_r8_q8_2 for HAVE_VNNI256. Covers the (default) non repacked code path for Q4_K, Q5_K, Q4_1 and Q5_1. This gives a +3–21% pp improvement on my hardware.

There's only 256-bit code in this block, so we would expect the change to be safe. I tested perplexity and prompt tested it anyways.

Thanks for your patience while I learn this code base. I realize now I should have just started with perf profiling which immediately identified this matmul as the hottest spot for the qwen35 models I'm using for testing.

Benchmark results

Note: Benchmark speeds are median based which is an experimental change independant of the VNNI code in this PR. No llama-bench changes are being proposed here.

`-rtr 0` (default, no runtime repack) — pp512, -t 6

Model	Baseline (median t/s)	PR (median t/s)	Change
Llama-3.2-1B Q4_K_M	239.86 ± 6.00	283.48 ± 2.89	+18.2%
gemma-3-1b Q4_K_S	274.60 ± 2.38	283.63 ± 2.85	+3.3%
Qwen3.5-0.8B Q4_K_S	379.36 ± 2.56	456.19 ± 3.56	+20.3%
Qwen3.5-9B Q4_K_S	34.96 ± 0.45	42.43 ± 0.35	+21.4%
Qwen3.5-35B-A3B Q4_K_S (MoE)	68.91 ± 0.31	72.71 ± 0.20	+5.5%

`-rtr 1` (runtime repack) — pp512, -t 6

Model	Baseline (median t/s)	PR (median t/s)	Change
Llama-3.2-1B Q4_K_M	271.86 ± 1.03	274.76 ± 2.17	+1.1%
gemma-3-1b Q4_K_S	299.39 ± 0.42	300.68 ± 1.41	+0.4%
Qwen3.5-0.8B Q4_K_S	426.88 ± 0.52	427.18 ± 4.04	+0.1%
Qwen3.5-9B Q4_K_S	40.14 ± 0.42	41.92 ± 0.62	+4.4%
Qwen3.5-35B-A3B Q4_K_S (MoE)	76.33 ± 0.61	76.22 ± 0.24	-0.1%

With -rtr 1, the Q4_K weights are repacked to Q4_K_R4 at load time and use a different matmul kernel (mul_mat_q4_k_r4_q8_k, already HAVE_VNNI256 enabled in PR #1446).

Quant type comparison (Qwen3.5-2B, `-rtr 0`) — pp512, -t 6

Quant	Baseline (median t/s)	PR (median t/s)	Change
Q2_K	145.90 ± 0.92	151.55 ± 0.68	+3.9%
Q4_K_M	155.94 ± 1.52	175.47 ± 1.44	+12.5%
Q5_K_M	152.30 ± 0.71	174.31 ± 0.26	+14.4%
Q8_0	135.17 ± 0.64	133.67 ± 0.81	-1.1%

The speedup scales with the fraction of Q4_K/Q5_K weights in a model as expected.

Benchmark test setup

Intel i5-13500, 6 P-cores pinned @ 2.5GHz, turbo off, HT off, E-cores offline. Release build, -DGGML_NATIVE=ON, 8 reps per test.

Perplexity (Qwen3.5-0.8B Q4_K_S, wikitext-2, 580 chunks)

Build	PPL
Baseline	19.5512 ± 0.159
PR	19.5512 ± 0.159

Ungate the VNNI path in mul_mat_q8_1_r8_q8_2 by changing the guard from HAVE_FANCY_SIMD to HAVE_VNNI256. This block only uses 256-bit intrinsics so it is safe for AVX-VNNI (non-512) CPUs.

ikawrakow · 2026-03-18T08:17:54Z

Btw, when you are looking for potential use of AVX512 intrinsics in the parts that prepare the block scales, it is not just a matter of 512-bit or 256-bit instructions to have been used. The implementations that do not use 512-bit instructions but are still within HAVE_FANCY_SIMD guards may still use 256-bit variants of one of the AVX512 extensions.

The main reason there are so many 256-bit variants is that most of this was developed on a Ryzen-7950X CPU (Zen4 core). The Zen4 core executes 512-bit instructions as two 256-bit instructions, so there is no performance gain from using the 512-bit variants. At the same time, it often takes more work to pack the quantized data into 512-bit SIMD registers than 256-bit. As a result, on the Zen4 core, often one gets a lower performance by using 512-bit instructions. The HAVE_FANCY_SIMD 256-bit paths use the VNNI/VL extension for dot product, but they may also use 256-bit instructions that are not available on AVX2 (masked instructions, shifts with 16-bit granularity, shuffles, etc.).

Enable AVX-VNNI 256-bit path for Q8_1 R8 dot product

774419f

Ungate the VNNI path in mul_mat_q8_1_r8_q8_2 by changing the guard from HAVE_FANCY_SIMD to HAVE_VNNI256. This block only uses 256-bit intrinsics so it is safe for AVX-VNNI (non-512) CPUs.

accaldwell marked this pull request as ready for review March 18, 2026 07:39

ikawrakow approved these changes Mar 18, 2026

View reviewed changes

ikawrakow merged commit 8ccb4f8 into ikawrakow:main Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable AVX-VNNI 256-bit path for Q8_1 R8 dot product#1455

Enable AVX-VNNI 256-bit path for Q8_1 R8 dot product#1455
ikawrakow merged 1 commit intoikawrakow:mainfrom
accaldwell:pr/vnni-q8_1-r8-dot

accaldwell commented Mar 18, 2026

Uh oh!

ikawrakow commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

accaldwell commented Mar 18, 2026

Benchmark results

-rtr 0 (default, no runtime repack) — pp512, -t 6

-rtr 1 (runtime repack) — pp512, -t 6

Quant type comparison (Qwen3.5-2B, -rtr 0) — pp512, -t 6

Benchmark test setup

Perplexity (Qwen3.5-0.8B Q4_K_S, wikitext-2, 580 chunks)

Uh oh!

ikawrakow commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`-rtr 0` (default, no runtime repack) — pp512, -t 6

`-rtr 1` (runtime repack) — pp512, -t 6

Quant type comparison (Qwen3.5-2B, `-rtr 0`) — pp512, -t 6