Skip to content

Enable AVX-VNNI 256-bit path for IQ3_XXS and IQ3_S R4 matmul#1474

Merged
ikawrakow merged 1 commit intoikawrakow:mainfrom
accaldwell:ac/vnni_iq3_r4
Mar 20, 2026
Merged

Enable AVX-VNNI 256-bit path for IQ3_XXS and IQ3_S R4 matmul#1474
ikawrakow merged 1 commit intoikawrakow:mainfrom
accaldwell:ac/vnni_iq3_r4

Conversation

@accaldwell
Copy link
Copy Markdown
Contributor

@accaldwell accaldwell commented Mar 20, 2026

Adding HAVE_VNNI256 optimized paths for mul_mat_iq3_xxs_r4_q8_k and mul_mat_iq3_s_r4_q8_k kernels.

The method we take here is very similar to some of my previous PRs, adding and conditionally using an optimized dpbusd path instead of the multi-instruction AVX2 alternative.

FANCY is doing its own thing above the respective new VNNI code blocks and shouldn't reach this new code.

Performance

Sweep bench (Qwen3.5-2B IQ3_XS, 6 P-core threads, --run-time-repack, N_KV 0-16384):

sweep

Small but solid boost to pp and the data is suggestive of a tiny increase in tg as well

QA

  • Token generation: identical output across 4 prompts (deterministic, seed=42)
  • Perplexity: identical, 14.0944 +/- 0.10305

@accaldwell accaldwell marked this pull request as ready for review March 20, 2026 09:38
@ikawrakow ikawrakow merged commit ac4d6b9 into ikawrakow:main Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants