Enable AVX-VNNI 256-bit path for Q8_1 R8 dot product#1455
Enable AVX-VNNI 256-bit path for Q8_1 R8 dot product#1455ikawrakow merged 1 commit intoikawrakow:mainfrom
Conversation
Ungate the VNNI path in mul_mat_q8_1_r8_q8_2 by changing the guard from HAVE_FANCY_SIMD to HAVE_VNNI256. This block only uses 256-bit intrinsics so it is safe for AVX-VNNI (non-512) CPUs.
|
Btw, when you are looking for potential use of The main reason there are so many 256-bit variants is that most of this was developed on a Ryzen-7950X CPU (Zen4 core). The Zen4 core executes 512-bit instructions as two 256-bit instructions, so there is no performance gain from using the 512-bit variants. At the same time, it often takes more work to pack the quantized data into 512-bit SIMD registers than 256-bit. As a result, on the Zen4 core, often one gets a lower performance by using 512-bit instructions. The |
One-liner ungating the VNNI path for mul_mat_q8_1_r8_q8_2 for HAVE_VNNI256. Covers the (default) non repacked code path for Q4_K, Q5_K, Q4_1 and Q5_1. This gives a +3–21% pp improvement on my hardware.
There's only 256-bit code in this block, so we would expect the change to be safe. I tested perplexity and prompt tested it anyways.
Thanks for your patience while I learn this code base. I realize now I should have just started with perf profiling which immediately identified this matmul as the hottest spot for the qwen35 models I'm using for testing.
Benchmark results
Note: Benchmark speeds are median based which is an experimental change independant of the VNNI code in this PR. No llama-bench changes are being proposed here.
-rtr 0(default, no runtime repack) — pp512, -t 6-rtr 1(runtime repack) — pp512, -t 6With
-rtr 1, the Q4_K weights are repacked to Q4_K_R4 at load time and use a different matmul kernel (mul_mat_q4_k_r4_q8_k, already HAVE_VNNI256 enabled in PR #1446).Quant type comparison (Qwen3.5-2B,
-rtr 0) — pp512, -t 6The speedup scales with the fraction of Q4_K/Q5_K weights in a model as expected.
Benchmark test setup
Intel i5-13500, 6 P-cores pinned @ 2.5GHz, turbo off, HT off, E-cores offline. Release build,
-DGGML_NATIVE=ON, 8 reps per test.Perplexity (Qwen3.5-0.8B Q4_K_S, wikitext-2, 580 chunks)