Skip to content

Enable AVX-VNNI 256-bit path for Q3_K R4 matmul#1472

Merged
ikawrakow merged 1 commit intoikawrakow:mainfrom
accaldwell:ac/vnni-q3k-r4
Mar 23, 2026
Merged

Enable AVX-VNNI 256-bit path for Q3_K R4 matmul#1472
ikawrakow merged 1 commit intoikawrakow:mainfrom
accaldwell:ac/vnni-q3k-r4

Conversation

@accaldwell
Copy link
Copy Markdown
Contributor

@accaldwell accaldwell commented Mar 20, 2026

Enable VNNI (vpdpbusd and vpdpwssd) for 256-bit VNNI in mul_mat_q3_k_r4_q8_k. This relaxes FANCY_SIMD to HAVE_VNNI256 and refactors it to use a method between the old FANCY and fallback modes.

Why not just ungate the existing FANCY_SIMD code? In performance profiling of token generation, this just moved the compute bottleneck to vpmulld. Token generation isnt compute limited, but the PR approach is more efficient and performance profiling shows it is now even more dominated by vmovdqu (memory loads/stores). Its important to note this also changes how AVX-512 platforms run this kernel - its possible the original code is better on AVX-512 CPUs and this PR will cause a regression. I am willing to do AVX-512 testing if needed.

Prompt processing naturally benefits from this PR due to the now familiar change to use VNNI fused instructions in place of the AVX2 alternatives.

Performance

Qwen3.5-2B, Q3_K_M, -rtr 1

q3km_sweep_comparison

QA

Qwen3.5-2B, Q3_K_M, --run-time-repack

Text generations on general knowledge questions were bit identical. Perplexity testing was identical on a full run (13.8283 +/- 0.10343).

Details

More details below the fold (details section co-written with an agent)

Details

Approach

A naive port of the HAVE_FANCY_SIMD path introduced a vpmulld bottleneck. The original AVX-512 path deferred float conversion and accumulated integer results using _mm256_mullo_epi32(iscales, sumi). On Raptor Lake P-cores, vpmulld consumed ~15% of the function's cycles, replacing the memory-move stall as the dominant bottleneck.

The hybrid approach taken here uses VNNI for the dot products only, while keeping the baseline's per-sub-block float accumulation strategy:

  • _mm256_maddubs_epi16 + _mm256_add_epi16 + _mm256_madd_epi16 (3 instructions) replaced by _mm256_dpbusd_epi32 (1 instruction)
  • _mm256_madd_epi16 + _mm256_add_epi32 replaced by _mm256_dpwssd_epi32 for bsums correction
  • No isum integer accumulator, no vpmulld, no deferred float conversion

PP perf profile (pp512, 6 threads)

Metric Baseline Hybrid (VNNI256) Delta
Throughput 295 t/s 319 t/s +8.1%
Instructions 658.9B 610.2B −7.4%
Cycles 220.0B 202.9B −7.8%
IPC 3.00 3.01 ~same

The speedup comes entirely from executing fewer instructions at the same IPC. The VNNI path replaces a 3-instruction dot product chain with a 2-instruction chain:

Instruction Baseline Hybrid
vpmaddubsw 32 --
vpaddw 24 --
vpmaddwd 40 --
vpdpbusd -- 32
vpdpwssd -- 32
Dot product total 96 64

96 → 64 dot product instructions (−33%). Total function instructions: 820 → 779 (−5%). The float accumulation path (vcvtdq2ps, vfmadd*) is unchanged at 33 instructions each.

Instruction profile, prompt processing (mul_mat_q3_k_r4_q8_k, top instructions)

Instruction Baseline Hybrid
vpdpbusd -- ~20%
vpshufd ~14% ~16%
vpmaddubsw ~14% --
vpaddw ~11% --
vfmadd213ps ~9% ~6%
vmovdqu (move) ~5% ~8%
vcvtdq2ps ~6% ~7%
vpmaddwd ~6% --
vpsrlw ~5% ~6%
vmovaps ~5% ~4%
vpdpwssd -- ~3%
vmulps ~3% ~3%
vfmadd231ps ~3% ~2%

In PP, vmovdqu is only ~5-8% of cycles (vs ~31-36% in TG), confirming the workload is compute-bound. The dot product chain (vpmaddubsw + vpaddw + vpmaddwd = ~31%) is replaced by vpdpbusd + vpdpwssd (~23%), directly reducing compute pressure.

Instruction profile, token generation (mul_mat_q3_k_r4_q8_k, top instructions)

Instruction Baseline Hybrid
vmovdqu (move) ~31% ~36%
vpmaddubsw ~10% --
vpmaddwd ~10% --
vpdpbusd -- ~11%
vpmulld -- --

The hybrid path eliminates vpmaddubsw + vpmaddwd in favor of vpdpbusd, with no vpmulld bottleneck. The memory move (vmovdqu) is now more dominant, confirming the workload is cleanly memory-bound.

Why not a naive port of HAVE_FANCY_SIMD?

A direct HAVE_FANCY_SIMD -> HAVE_VNNI256 substitution was tested first. That path defers float conversion and uses_mm256_mullo_epi32(iscales, sumi) to accumulate integer results. On Raptor Lake P-cores, vpmulld consumed ~15% of the function's cycles — becoming the second hottest instruction after the memory move and replacing the memory stall as the compute bottleneck. On E-cores (Gracemont here) the problem was worse: vpmulld hit ~13% and the overall throughput regressed by ~5% (20.96 vs 22.04 t/s).

@accaldwell accaldwell marked this pull request as ready for review March 20, 2026 04:15
@accaldwell
Copy link
Copy Markdown
Contributor Author

Unsurprisingly, this benefits anywhere Q3_K is found:

compare_all_quants

@ikawrakow
Copy link
Copy Markdown
Owner

its possible the original code is better on AVX-512 CPUs and this PR will cause a regression.

It does indeed cause a regression. On my Ryzen-7950X CPU with LlaMA-3.1-8B quantized to Q3_K_S using -rtr

Main branch

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 64 0 4.039 253.51 3.806 16.81
1024 64 1024 4.341 235.88 4.058 15.77
1024 64 2048 4.632 221.05 4.127 15.51
1024 64 3072 4.950 206.89 4.359 14.68
1024 64 4096 5.253 194.93 4.522 14.15
1024 64 5120 5.567 183.94 4.687 13.66
1024 64 6144 5.899 173.58 4.850 13.20
1024 64 7168 6.235 164.23 5.010 12.77
1024 64 8192 6.588 155.44 5.170 12.38

This PR

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 64 0 4.353 235.26 3.795 16.86
1024 64 1024 4.638 220.79 4.068 15.73
1024 64 2048 4.947 206.99 4.176 15.32
1024 64 3072 5.272 194.23 4.360 14.68
1024 64 4096 5.577 183.61 4.519 14.16
1024 64 5120 5.893 173.78 4.679 13.68
1024 64 6144 6.213 164.82 4.840 13.22
1024 64 7168 6.551 156.31 5.006 12.79
1024 64 8192 6.900 148.40 5.163 12.40

TG is within the noise (because memory bandwidth bound). PP with this PR is ~8% lower. The HAVE_FANCY_SIMD implementation is like this not by a random chance.

So, I think you need to remove the changes to the HAVE_FANCY_SIMD path and have a separate path for HAVE_VNNI.

Juts for fun, here is what one gets without rtr on the Ryzen-7950X

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 64 0 2.570 398.44 3.788 16.89
1024 64 1024 2.863 357.62 3.964 16.15
1024 64 2048 3.152 324.89 4.100 15.61
1024 64 3072 3.455 296.42 4.242 15.09
1024 64 4096 3.744 273.49 4.387 14.59
1024 64 5120 4.048 252.94 4.535 14.11
1024 64 6144 4.377 233.96 4.681 13.67
1024 64 7168 4.714 217.22 4.825 13.26
1024 64 8192 5.056 202.55 4.974 12.87

@accaldwell
Copy link
Copy Markdown
Contributor Author

Refactoring this to have a FANCY path with the original code instead of sending them to the new VNNI256 path should be straightforward, I'll work on a v2

@accaldwell accaldwell marked this pull request as draft March 20, 2026 09:20
Add a separate HAVE_VNNI256 code path using _mm256_dpwssd_epi32 and
_mm256_dpbusd_epi32 for the Q3_K R4 kernel. The existing HAVE_FANCY_SIMD
(AVX-512 VNNI) path is preserved unchanged.
@accaldwell
Copy link
Copy Markdown
Contributor Author

accaldwell commented Mar 20, 2026

The initial version of this patch had a two tier hierarchy: VNNI256 (including FANCY), falling back to AVX2. I changed the existing FANCY code in a way that worked well for me on VNNI256, but caused a regression on FANCY.

The new approach is three tier: FANCY keeps its original code, falling back to VNNI256, then to AVX2.

Even aside from the confirmed regression with my previous attempt, this feels like a better, more conservative change.

I benchmarked pp against 0c9bc3e with and without this patch: 248.51 ± 0.79 -> 273.70 ± 1.41 on Qwen3.5-2B Q3_K_M. Text generation is a perfect match. I didnt test perplexity but I'm happy to do so if needed

@accaldwell accaldwell marked this pull request as ready for review March 20, 2026 15:49
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 21, 2026
@ikawrakow ikawrakow merged commit 47e4015 into ikawrakow:main Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants