Add optimization report for Granite Hybrid Q4_K_M on AVX2 CPUs by dillon-blake · Pull Request #5 · Boxed-Logic/llama.cpp

dillon-blake · 2026-02-09T01:24:57Z

Detailed analysis of llama.cpp kernel implementations for the Granite 4 Hybrid (Mamba2+Attention+MoE) model with Q4_K_M quantization targeting AMD64 laptop CPUs with AVX2.

Three proposals with testing plans:

Software prefetching for Q4_K dot product kernels (est. +5-10% prefill)
SIMD vectorization of scalar SSM convolution kernel (est. +3-6% prefill)
Cache-aligned tensor allocation + repacked GEMV prefetch (est. +5-10% prefill)

https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy

Make sure to read the contributing guidelines before submitting a PR

Detailed analysis of llama.cpp kernel implementations for the Granite 4 Hybrid (Mamba2+Attention+MoE) model with Q4_K_M quantization targeting AMD64 laptop CPUs with AVX2. Three proposals with testing plans: 1. Software prefetching for Q4_K dot product kernels (est. +5-10% prefill) 2. SIMD vectorization of scalar SSM convolution kernel (est. +3-6% prefill) 3. Cache-aligned tensor allocation + repacked GEMV prefetch (est. +5-10% prefill) https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy

Complete rewrite covering both prefill and decode paths with five concrete changes across four files. Key finding: decode uses standard vec_dot (not repacked GEMV) and is dominated by SSM state bandwidth, not weight access. Changes proposed: 1. Q4_K vec_dot prefetch (decode: +5-8%) 2. Repacked GEMV/GEMM prefetch (prefill: +5-10%) 3. SSM scan state prefetch (decode: +8-15%) 4. SSM conv AVX2 vectorization (both: +3-5%) 5. TENSOR_ALIGNMENT 32->64 (both: +1-3%) Includes 6-phase testing plan and Granite Hybrid architecture analysis. https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy

GGUF file format only guarantees 32-byte alignment (GGUF_DEFAULT_ALIGNMENT in gguf.h:46). Increasing TENSOR_ALIGNMENT to 64 would cause assertion failures at ggml-backend.cpp:2268 for any mmap'd model, which is the default loading path. KV cache and Mamba state save/load are unaffected by any of the remaining 4 changes (prefetch + SSM conv SIMD) since state serialization writes raw tensor values, not memory layouts. https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy

Four targeted changes for Granite 4 Hybrid (Mamba2+Attention+MoE) Q4_K_M inference on AMD64 CPUs with AVX2: 1. Q4_K vec_dot prefetch (quants.c): Prefetch weight+activation blocks 2 iterations ahead in ggml_vec_dot_q4_K_q8_K. This is the decode matmul path. Mirrors existing Q4_0 prefetch pattern. 2. Repacked GEMV/GEMM prefetch (repack.cpp): Prefetch next Q4_Kx8 block (header + first 4 cache lines of qs) in both ggml_gemv_q4_K_8x8_q8_K and ggml_gemm_q4_K_8x8_q8_K. This is the prefill matmul path. 3. SSM scan state prefetch (ops.cpp): Prefetch state arrays 4 rows ahead and B/C vectors at head boundaries in ssm_scan_f32. Targets the ~384KB/layer state streaming that dominates Mamba2 decode. 4. SSM conv AVX2 vectorization (ops.cpp): Replace scalar d_inner loop with AVX2 FMA processing 8 rows at a time in ssm_conv_f32. The kernel was entirely unvectorized. Scalar remainder handles non-8 aligned dimensions. Test results: SSM_CONV 27/27, SSM_SCAN 3/3, MUL_MAT 1009/1009 passed. https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy

Implement nrc=2 path in ggml_vec_dot_q4_K_q8_K that processes two weight rows against the same activation vector simultaneously. This shares Q8_K activation loads across both rows, reducing load port pressure by ~33% (4 loads per 2 rows vs 6 loads with separate calls). The inner j=0..3 loop is fully unrolled to eliminate branch overhead and allow better register scheduling across sub-blocks. Enable nrows=2 in type_traits_cpu for Q4_K on AVX2 (previously only ARM MATMUL_INT8 had multi-row support). All tests pass: MUL_MAT 1009/1009, SSM_CONV 27/27, SSM_SCAN 3/3. https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy

…e on AVX2 The nrc==2 calling convention produces a 2x2 output tile (4 dot products: 2 weight rows × 2 activation columns), as required by the ARM MMLA instruction. On x86 AVX2 with only 16 ymm registers, computing 4 simultaneous dot products causes massive register spills, making it slower than the baseline nrc==1 path. Revert nrows to 1 for Q4_K on x86 and remove the incorrect nrc==2 kernel. Keep nrows=2 for ARM MMLA where the hardware natively supports 2x2 tile computation. All other enhancements (prefetch in vec_dot/GEMV/GEMM, SSM conv AVX2, SSM scan prefetch) remain unchanged. Tests: 1009/1009 MUL_MAT, 30/30 SSM_CONV/SSM_SCAN pass. https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy

claude added 6 commits February 9, 2026 01:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimization report for Granite Hybrid Q4_K_M on AVX2 CPUs#5

Add optimization report for Granite Hybrid Q4_K_M on AVX2 CPUs#5
dillon-blake wants to merge 6 commits intomasterfrom
claude/optimize-ml-kernels-PF3Ng

dillon-blake commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dillon-blake commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants