Skip to content

Enable AVX-VNNI 256-bit path for IQ4_NL R4 matmul#1467

Merged
ikawrakow merged 1 commit intoikawrakow:mainfrom
accaldwell:ac/vnni-iq4nl-r4-matmul
Mar 20, 2026
Merged

Enable AVX-VNNI 256-bit path for IQ4_NL R4 matmul#1467
ikawrakow merged 1 commit intoikawrakow:mainfrom
accaldwell:ac/vnni-iq4nl-r4-matmul

Conversation

@accaldwell
Copy link
Copy Markdown
Contributor

@accaldwell accaldwell commented Mar 19, 2026

IQ4_NL has a special kernel in repacked (R4) mode (mul_mat_iq4_nl_r4_q8_2).

It currently has a FANCY_SIMD path that requires AVX-512, here we update the fallback AVX2 path to have a conditionally VNNI accelerated path on AVX-VNNI CPUs.

Benchmarks

Model: Qwen3.5-2B IQ4_NL, pp512

rtr 0 (control - different kernel that is already VNNI optimized)

Build t/s
Baseline 271.10 ± 0.59
PR 270.35 ± 1.31

rtr 1 (runtime repack - uses the newly optimized kernel)

Build t/s
Baseline 189.13 ± 0.37
PR 246.07 ± 1.98

Big improvement here, though rtr 0 is still faster on my hardware for this quant.

Text generation QA

Text generation QA with llama-cli across multiple prompts shows bit-identical results. Full perplexity against wikitest-2 is unchanged as well (13.1025 +/- 0.09740).

@accaldwell accaldwell marked this pull request as ready for review March 19, 2026 22:11
@ikawrakow ikawrakow merged commit a56a786 into ikawrakow:main Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants