Enable AVX-VNNI 256-bit path for Q3_K R4 matmul#1472
Enable AVX-VNNI 256-bit path for Q3_K R4 matmul#1472ikawrakow merged 1 commit intoikawrakow:mainfrom
Conversation
It does indeed cause a regression. On my Ryzen-7950X CPU with LlaMA-3.1-8B quantized to Main branch
This PR
TG is within the noise (because memory bandwidth bound). PP with this PR is ~8% lower. The So, I think you need to remove the changes to the Juts for fun, here is what one gets without
|
|
Refactoring this to have a FANCY path with the original code instead of sending them to the new VNNI256 path should be straightforward, I'll work on a v2 |
Add a separate HAVE_VNNI256 code path using _mm256_dpwssd_epi32 and _mm256_dpbusd_epi32 for the Q3_K R4 kernel. The existing HAVE_FANCY_SIMD (AVX-512 VNNI) path is preserved unchanged.
598c048 to
6e6a98f
Compare
|
The initial version of this patch had a two tier hierarchy: VNNI256 (including FANCY), falling back to AVX2. I changed the existing FANCY code in a way that worked well for me on VNNI256, but caused a regression on FANCY. The new approach is three tier: FANCY keeps its original code, falling back to VNNI256, then to AVX2. Even aside from the confirmed regression with my previous attempt, this feels like a better, more conservative change. I benchmarked pp against 0c9bc3e with and without this patch: 248.51 ± 0.79 -> 273.70 ± 1.41 on Qwen3.5-2B Q3_K_M. Text generation is a perfect match. I didnt test perplexity but I'm happy to do so if needed |

Enable VNNI (
vpdpbusdandvpdpwssd) for 256-bit VNNI inmul_mat_q3_k_r4_q8_k. This relaxes FANCY_SIMD to HAVE_VNNI256 and refactors it to use a method between the old FANCY and fallback modes.Why not just ungate the existing FANCY_SIMD code? In performance profiling of token generation, this just moved the compute bottleneck to
vpmulld. Token generation isnt compute limited, but the PR approach is more efficient and performance profiling shows it is now even more dominated byvmovdqu(memory loads/stores). Its important to note this also changes how AVX-512 platforms run this kernel - its possible the original code is better on AVX-512 CPUs and this PR will cause a regression. I am willing to do AVX-512 testing if needed.Prompt processing naturally benefits from this PR due to the now familiar change to use VNNI fused instructions in place of the AVX2 alternatives.
Performance
Qwen3.5-2B, Q3_K_M, -rtr 1
QA
Qwen3.5-2B, Q3_K_M, --run-time-repack
Text generations on general knowledge questions were bit identical. Perplexity testing was identical on a full run (13.8283 +/- 0.10343).
Details
More details below the fold (details section co-written with an agent)
Details
Approach
A naive port of the
HAVE_FANCY_SIMDpath introduced avpmulldbottleneck. The original AVX-512 path deferred float conversion and accumulated integer results using_mm256_mullo_epi32(iscales, sumi). On Raptor Lake P-cores,vpmulldconsumed ~15% of the function's cycles, replacing the memory-move stall as the dominant bottleneck.The hybrid approach taken here uses VNNI for the dot products only, while keeping the baseline's per-sub-block float accumulation strategy:
_mm256_maddubs_epi16+_mm256_add_epi16+_mm256_madd_epi16(3 instructions) replaced by_mm256_dpbusd_epi32(1 instruction)_mm256_madd_epi16+_mm256_add_epi32replaced by_mm256_dpwssd_epi32for bsums correctionisuminteger accumulator, novpmulld, no deferred float conversionPP perf profile (pp512, 6 threads)
The speedup comes entirely from executing fewer instructions at the same IPC. The VNNI path replaces a 3-instruction dot product chain with a 2-instruction chain:
96 → 64 dot product instructions (−33%). Total function instructions: 820 → 779 (−5%). The float accumulation path (vcvtdq2ps, vfmadd*) is unchanged at 33 instructions each.
Instruction profile, prompt processing (mul_mat_q3_k_r4_q8_k, top instructions)
In PP,
vmovdquis only ~5-8% of cycles (vs ~31-36% in TG), confirming the workload is compute-bound. The dot product chain (vpmaddubsw+vpaddw+vpmaddwd= ~31%) is replaced byvpdpbusd+vpdpwssd(~23%), directly reducing compute pressure.Instruction profile, token generation (mul_mat_q3_k_r4_q8_k, top instructions)
The hybrid path eliminates
vpmaddubsw+vpmaddwdin favor ofvpdpbusd, with novpmulldbottleneck. The memory move (vmovdqu) is now more dominant, confirming the workload is cleanly memory-bound.Why not a naive port of HAVE_FANCY_SIMD?
A direct
HAVE_FANCY_SIMD->HAVE_VNNI256substitution was tested first. That path defers float conversion and uses_mm256_mullo_epi32(iscales, sumi)to accumulate integer results. On Raptor Lake P-cores,vpmulldconsumed ~15% of the function's cycles — becoming the second hottest instruction after the memory move and replacing the memory stall as the compute bottleneck. On E-cores (Gracemont here) the problem was worse:vpmulldhit ~13% and the overall throughput regressed by ~5% (20.96 vs 22.04 t/s).