Add optimization report for Granite Hybrid Q4_K_M on AVX2 CPUs#5
Open
dillon-blake wants to merge 6 commits intomasterfrom
Open
Add optimization report for Granite Hybrid Q4_K_M on AVX2 CPUs#5dillon-blake wants to merge 6 commits intomasterfrom
dillon-blake wants to merge 6 commits intomasterfrom
Conversation
Detailed analysis of llama.cpp kernel implementations for the Granite 4 Hybrid (Mamba2+Attention+MoE) model with Q4_K_M quantization targeting AMD64 laptop CPUs with AVX2. Three proposals with testing plans: 1. Software prefetching for Q4_K dot product kernels (est. +5-10% prefill) 2. SIMD vectorization of scalar SSM convolution kernel (est. +3-6% prefill) 3. Cache-aligned tensor allocation + repacked GEMV prefetch (est. +5-10% prefill) https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy
Complete rewrite covering both prefill and decode paths with five concrete changes across four files. Key finding: decode uses standard vec_dot (not repacked GEMV) and is dominated by SSM state bandwidth, not weight access. Changes proposed: 1. Q4_K vec_dot prefetch (decode: +5-8%) 2. Repacked GEMV/GEMM prefetch (prefill: +5-10%) 3. SSM scan state prefetch (decode: +8-15%) 4. SSM conv AVX2 vectorization (both: +3-5%) 5. TENSOR_ALIGNMENT 32->64 (both: +1-3%) Includes 6-phase testing plan and Granite Hybrid architecture analysis. https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy
GGUF file format only guarantees 32-byte alignment (GGUF_DEFAULT_ALIGNMENT in gguf.h:46). Increasing TENSOR_ALIGNMENT to 64 would cause assertion failures at ggml-backend.cpp:2268 for any mmap'd model, which is the default loading path. KV cache and Mamba state save/load are unaffected by any of the remaining 4 changes (prefetch + SSM conv SIMD) since state serialization writes raw tensor values, not memory layouts. https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy
Four targeted changes for Granite 4 Hybrid (Mamba2+Attention+MoE) Q4_K_M inference on AMD64 CPUs with AVX2: 1. Q4_K vec_dot prefetch (quants.c): Prefetch weight+activation blocks 2 iterations ahead in ggml_vec_dot_q4_K_q8_K. This is the decode matmul path. Mirrors existing Q4_0 prefetch pattern. 2. Repacked GEMV/GEMM prefetch (repack.cpp): Prefetch next Q4_Kx8 block (header + first 4 cache lines of qs) in both ggml_gemv_q4_K_8x8_q8_K and ggml_gemm_q4_K_8x8_q8_K. This is the prefill matmul path. 3. SSM scan state prefetch (ops.cpp): Prefetch state arrays 4 rows ahead and B/C vectors at head boundaries in ssm_scan_f32. Targets the ~384KB/layer state streaming that dominates Mamba2 decode. 4. SSM conv AVX2 vectorization (ops.cpp): Replace scalar d_inner loop with AVX2 FMA processing 8 rows at a time in ssm_conv_f32. The kernel was entirely unvectorized. Scalar remainder handles non-8 aligned dimensions. Test results: SSM_CONV 27/27, SSM_SCAN 3/3, MUL_MAT 1009/1009 passed. https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy
Implement nrc=2 path in ggml_vec_dot_q4_K_q8_K that processes two weight rows against the same activation vector simultaneously. This shares Q8_K activation loads across both rows, reducing load port pressure by ~33% (4 loads per 2 rows vs 6 loads with separate calls). The inner j=0..3 loop is fully unrolled to eliminate branch overhead and allow better register scheduling across sub-blocks. Enable nrows=2 in type_traits_cpu for Q4_K on AVX2 (previously only ARM MATMUL_INT8 had multi-row support). All tests pass: MUL_MAT 1009/1009, SSM_CONV 27/27, SSM_SCAN 3/3. https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy
…e on AVX2 The nrc==2 calling convention produces a 2x2 output tile (4 dot products: 2 weight rows × 2 activation columns), as required by the ARM MMLA instruction. On x86 AVX2 with only 16 ymm registers, computing 4 simultaneous dot products causes massive register spills, making it slower than the baseline nrc==1 path. Revert nrows to 1 for Q4_K on x86 and remove the incorrect nrc==2 kernel. Keep nrows=2 for ARM MMLA where the hardware natively supports 2x2 tile computation. All other enhancements (prefetch in vec_dot/GEMV/GEMM, SSM conv AVX2, SSM scan prefetch) remain unchanged. Tests: 1009/1009 MUL_MAT, 30/30 SSM_CONV/SSM_SCAN pass. https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Detailed analysis of llama.cpp kernel implementations for the Granite 4 Hybrid (Mamba2+Attention+MoE) model with Q4_K_M quantization targeting AMD64 laptop CPUs with AVX2.
Three proposals with testing plans:
https://claude.ai/code/session_01MQaNCwdTUz71XEjhJ51Fxy
Make sure to read the contributing guidelines before submitting a PR