Skip to content

Add optimization report for Granite Hybrid Q4_K_M on AVX2 CPUs#5

Open
dillon-blake wants to merge 6 commits intomasterfrom
claude/optimize-ml-kernels-PF3Ng
Open

Add optimization report for Granite Hybrid Q4_K_M on AVX2 CPUs#5
dillon-blake wants to merge 6 commits intomasterfrom
claude/optimize-ml-kernels-PF3Ng

Conversation

@dillon-blake
Copy link
Copy Markdown

Detailed analysis of llama.cpp kernel implementations for the Granite 4 Hybrid (Mamba2+Attention+MoE) model with Q4_K_M quantization targeting AMD64 laptop CPUs with AVX2.

Three proposals with testing plans:

  1. Software prefetching for Q4_K dot product kernels (est. +5-10% prefill)
  2. SIMD vectorization of scalar SSM convolution kernel (est. +3-6% prefill)
  3. Cache-aligned tensor allocation + repacked GEMV prefetch (est. +5-10% prefill)

https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy

Make sure to read the contributing guidelines before submitting a PR

Detailed analysis of llama.cpp kernel implementations for the Granite 4
Hybrid (Mamba2+Attention+MoE) model with Q4_K_M quantization targeting
AMD64 laptop CPUs with AVX2.

Three proposals with testing plans:
1. Software prefetching for Q4_K dot product kernels (est. +5-10% prefill)
2. SIMD vectorization of scalar SSM convolution kernel (est. +3-6% prefill)
3. Cache-aligned tensor allocation + repacked GEMV prefetch (est. +5-10% prefill)

https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy
Complete rewrite covering both prefill and decode paths with five concrete
changes across four files. Key finding: decode uses standard vec_dot (not
repacked GEMV) and is dominated by SSM state bandwidth, not weight access.

Changes proposed:
1. Q4_K vec_dot prefetch (decode: +5-8%)
2. Repacked GEMV/GEMM prefetch (prefill: +5-10%)
3. SSM scan state prefetch (decode: +8-15%)
4. SSM conv AVX2 vectorization (both: +3-5%)
5. TENSOR_ALIGNMENT 32->64 (both: +1-3%)

Includes 6-phase testing plan and Granite Hybrid architecture analysis.

https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy
GGUF file format only guarantees 32-byte alignment (GGUF_DEFAULT_ALIGNMENT
in gguf.h:46). Increasing TENSOR_ALIGNMENT to 64 would cause assertion
failures at ggml-backend.cpp:2268 for any mmap'd model, which is the
default loading path.

KV cache and Mamba state save/load are unaffected by any of the remaining
4 changes (prefetch + SSM conv SIMD) since state serialization writes raw
tensor values, not memory layouts.

https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy
Four targeted changes for Granite 4 Hybrid (Mamba2+Attention+MoE)
Q4_K_M inference on AMD64 CPUs with AVX2:

1. Q4_K vec_dot prefetch (quants.c): Prefetch weight+activation blocks
   2 iterations ahead in ggml_vec_dot_q4_K_q8_K. This is the decode
   matmul path. Mirrors existing Q4_0 prefetch pattern.

2. Repacked GEMV/GEMM prefetch (repack.cpp): Prefetch next Q4_Kx8
   block (header + first 4 cache lines of qs) in both
   ggml_gemv_q4_K_8x8_q8_K and ggml_gemm_q4_K_8x8_q8_K. This is the
   prefill matmul path.

3. SSM scan state prefetch (ops.cpp): Prefetch state arrays 4 rows
   ahead and B/C vectors at head boundaries in ssm_scan_f32. Targets
   the ~384KB/layer state streaming that dominates Mamba2 decode.

4. SSM conv AVX2 vectorization (ops.cpp): Replace scalar d_inner loop
   with AVX2 FMA processing 8 rows at a time in ssm_conv_f32. The
   kernel was entirely unvectorized. Scalar remainder handles non-8
   aligned dimensions.

Test results: SSM_CONV 27/27, SSM_SCAN 3/3, MUL_MAT 1009/1009 passed.

https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy
Implement nrc=2 path in ggml_vec_dot_q4_K_q8_K that processes two
weight rows against the same activation vector simultaneously. This
shares Q8_K activation loads across both rows, reducing load port
pressure by ~33% (4 loads per 2 rows vs 6 loads with separate calls).

The inner j=0..3 loop is fully unrolled to eliminate branch overhead
and allow better register scheduling across sub-blocks.

Enable nrows=2 in type_traits_cpu for Q4_K on AVX2 (previously only
ARM MATMUL_INT8 had multi-row support).

All tests pass: MUL_MAT 1009/1009, SSM_CONV 27/27, SSM_SCAN 3/3.

https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy
…e on AVX2

The nrc==2 calling convention produces a 2x2 output tile (4 dot products:
2 weight rows × 2 activation columns), as required by the ARM MMLA
instruction. On x86 AVX2 with only 16 ymm registers, computing 4
simultaneous dot products causes massive register spills, making it
slower than the baseline nrc==1 path.

Revert nrows to 1 for Q4_K on x86 and remove the incorrect nrc==2
kernel. Keep nrows=2 for ARM MMLA where the hardware natively supports
2x2 tile computation.

All other enhancements (prefetch in vec_dot/GEMV/GEMM, SSM conv AVX2,
SSM scan prefetch) remain unchanged.

Tests: 1009/1009 MUL_MAT, 30/30 SSM_CONV/SSM_SCAN pass.

https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants