ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860#18888
Merged
ggerganov merged 19 commits intoggml-org:masterfrom Jan 27, 2026
Merged
Conversation
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
69be98f to
8426ec2
Compare
ggerganov
approved these changes
Jan 26, 2026
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
1 task
Collaborator
Author
|
@ggerganov Comments addressed. The failing CI task seems unrelated, I've seen it around in other PRs and it's affecting x64, which is not affected by this PR |
shaofeiqi
pushed a commit
to qualcomm/llama.cpp
that referenced
this pull request
Feb 6, 2026
…ions (i8mm) ggml-org#18860 (ggml-org#18888) * Boilerplate for q6_K repack * q6_K repack to q6_Kx8 implementation Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * q6_K generic gemv and gemm * wip, gemm_q6_K 8x8 * Still WIP: loading of q8s, q6h and q6l * first working version of q6_K gemm * Moved q6 loads outside of sb block, Unrolled inner loop * Replaced modulo with mask * First implementation of GEMV * ggml_vdotq_s32 -> vdotq_s32 * Reduce width of accumulators in q6_K gemv * Bsums instead of calc bias. Preload scales to use vget_lane. Unroll. * Reuse scales in GEMM (same GEMV opt) * Added todos for bsum and different qh repack * Arch fallback * VSLIQ for merging qh adn ql * Removed TODO, already tested * Apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Removed unused import --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Continuation of repack work for ARM, since
q4_K_Mandq5_K_Mquantizations spend ~%20 of compute time on q6_K layers.Same testing practices from the other repack implementations.
M4 (-DGGML_BLAS=OFF -DGGML_METAL=OFF)
Exynos 2400
Perplexity
llama-cli using repack
llama-cli using generic
diff --git a/ggml/src/ggml-cpu/arch/arm/repack.cpp b/ggml/src/ggml-cpu/arch/arm/repack.cpp index e0f92798d..361692994 100644 --- a/ggml/src/ggml-cpu/arch/arm/repack.cpp +++ b/ggml/src/ggml-cpu/arch/arm/repack.cpp @@ -1097,7 +1097,7 @@ void ggml_gemv_q6_K_8x8_q8_K(int n, UNUSED(ncols_interleaved); UNUSED(blocklen); -#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD) +#if 0 && defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD) constexpr int col_pairs = ncols_interleaved / 2; const uint8x16_t m4b = vdupq_n_u8(0x0f); const uint8x16_t mask_lo = vdupq_n_u8(0x03); @@ -3495,7 +3495,7 @@ void ggml_gemm_q6_K_8x8_q8_K(int n, UNUSED(ncols_interleaved); UNUSED(blocklen); -#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8) +#if 0 && defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8) constexpr int q8_k_blocklen = 4; const uint8x16_t m4b = vdupq_n_u8(0x0f); const uint8x16_t mask_lo = vdupq_n_u8(0x03);