ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860 by Alcpz · Pull Request #18888 · ggml-org/llama.cpp

Alcpz · 2026-01-16T23:42:21Z

Continuation of repack work for ARM, since q4_K_M and q5_K_M quantizations spend ~%20 of compute time on q6_K layers.

Still pending rebasing on top of ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) #18860 if that gets merged.

Same testing practices from the other repack implementations.

M4 (-DGGML_BLAS=OFF -DGGML_METAL=OFF)

model	threads	test	master t/s	this PR t/s	Speedup
LFM2-1.2B Q4_K_M	8	pp512	589.99	685.00	1.16
LFM2-1.2B Q4_K_M	8	tg128	230.97	238.18	1.03
qwen3 8B Q4_K_M	8	pp512	88.34	106.44	1.20
qwen3 8B Q4_K_M	8	tg128	38.69	40.75	1.05
llama 8B Q4_K_M	8	pp512	89.19	106.74	1.20
llama 8B Q4_K_M	8	tg128	39.59	41.45	1.05
LFM2-1.2B Q6_K	8	pp512	282.56	591.21	2.09
LFM2-1.2B Q6_K	8	tg128	186.54	192.96	1.03
qwen3 8B Q6_K	8	pp512	42.93	94.97	2.21
qwen3 8B Q6_K	8	tg128	26.59	30.48	1.15
llama 8B Q6_K	8	pp512	43.01	91.11	2.12
llama 8B Q6_K	8	tg128	26.97	30.78	1.14

Exynos 2400

model	threads	test	master t/s	this PR t/s	Speedup
LFM2-350M Q4_K_M	3	pp512	275.44	303.76	1.10
LFM2-350M Q4_K_M	3	tg128	75.02	93.92	1.25
LFM2-1.2B Q4_K_M	3	pp512	76.89	96.84	1.26
LFM2-1.2B Q4_K_M	3	tg128	27.12	33.33	1.23
LFM2-350M Q6_K	3	pp512	124.72	225.80	1.81
LFM2-350M Q6_K	3	tg128	54.13	61.33	1.13
LFM2-1.2B Q6_K	3	pp512	28.78	70.39	2.45
LFM2-1.2B Q6_K	3	tg128	20.99	21.77	1.04

Perplexity

model	Repack OFF	Repack ON	Generic
LFM2 1.2B Q6_K	16.7901 ± 0.96392	16.7901 ± 0.96392	16.8182 ± 0.96585
Meta Llama 3.1 8B Instruct Q6_K	8.7109 ± 0.42663	8.7109 ± 0.42663	8.6987 ± 0.42586
Qwen3 8B 128K Q6_K	11.1063 ± 0.67258	11.1063 ± 0.67258	11.1544 ± 0.67856

llama-cli using repack

build      : b7779-4d328492c
model      : LFM2-350M-Q6_K.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is the capital of Italy?

The capital of Italy is Rome. It is a historic, vibrant city located on the northern peninsula of central Italy. Known for its rich history, art, architecture, and delicious food, Rome has been the political, cultural, and religious center of the country since the Roman Empire fell.

[ Prompt: 766.2 t/s | Generation: 417.9 t/s ]

> What would be great to eat there?

Rome is a culinary paradise with a plethora of delicious dishes to savor. Here are some great places to eat:

1. **Comforte**: This is a traditional Roman eatery serving classic Roman comfort food like ravioli, carbonara, and pizza romana (Pizza Romana). The platter is usually topped with prosciutto, fresh tomatoes, and basil.

2. **Osteria Morini**: Known for its elegant service and home-style tasting menu, Osteria Morini offers a wide range of Italian wines paired with dishes like ribollita, pappardelle al cinghiale (shepherd's pie), and panettone.

3. **Bistro**: For a more refined dining experience, Bistros like Giordano's and Le Pagliaccio provide high-quality, seasonal Italian cuisine, often with an Italian twist.

4. **Trattoria**: A casual eatery with friendly service, Trattorias like L'Artusi and Osteria Morini offer hearty dishes such as bistecca alla fiorentina (T-bone steak) and carbonara.

5. **Ristorante Carbone**: This upscale spot specializes in barbecued meats and French-inspired dishes, providing a sophisticated taste experience.

6. **Il Mulino**: A famous restaurant with a rich history, it serves authentic Neapolitan cuisine, including ravioli, arancini, and cicchetti (small plates).

7. **Luigi's Pizza**: For a quick and authentic pizza experience, visit Luigi's, which has been a staple in Rome since 1849.

These locations are just a few of the many excellent spots to enjoy Italian food in Rome. Enjoy exploring its many culinary delights!

[ Prompt: 588.5 t/s | Generation: 426.8 t/s ]

llama-cli using generic

  diff --git a/ggml/src/ggml-cpu/arch/arm/repack.cpp b/ggml/src/ggml-cpu/arch/arm/repack.cpp
  index e0f92798d..361692994 100644
  --- a/ggml/src/ggml-cpu/arch/arm/repack.cpp
  +++ b/ggml/src/ggml-cpu/arch/arm/repack.cpp
  @@ -1097,7 +1097,7 @@ void ggml_gemv_q6_K_8x8_q8_K(int                        n,
       UNUSED(ncols_interleaved);
       UNUSED(blocklen);

  -#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
  +#if 0 && defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
       constexpr int    col_pairs = ncols_interleaved / 2;
       const uint8x16_t m4b       = vdupq_n_u8(0x0f);
       const uint8x16_t mask_lo   = vdupq_n_u8(0x03);
  @@ -3495,7 +3495,7 @@ void ggml_gemm_q6_K_8x8_q8_K(int                        n,
       UNUSED(ncols_interleaved);
       UNUSED(blocklen);

  -#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8)
  +#if 0 && defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8)
       constexpr int    q8_k_blocklen = 4;
       const uint8x16_t m4b           = vdupq_n_u8(0x0f);
       const uint8x16_t mask_lo       = vdupq_n_u8(0x03);

build      : b7779-4d328492c
model      : LFM2-350M-Q6_K.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is the capital of Italy?

The capital of Italy is Rome. It's the largest city in the country and serves as its political, economic, and cultural center. With a rich history dating back to the Etruscans, Romans, and later, influences from the Byzantine and Aragonese Empires, Rome has played a pivotal role in shaping Western civilization.

[ Prompt: 203.8 t/s | Generation: 209.2 t/s ]

> > What would be great to visit in there?

Italy is incredibly rich in history and culture, offering a wide range of attractions to explore. Here are some great things to visit:

1. **The Colosseum**: An ancient amphitheater that showcases the grandeur of Roman architecture and gladiatorial games.
2. **The Vatican City**: Home to numerous important religious sites including the Sistine Chapel, St. Peter's Basilica, and the Vatican Museums, which house some of the world's most famous artworks.
3. **Piazza Navona**: A charming square with beautiful baroque architecture, offering a mix of history, culture, and entertainment.
4. **Rome's Pantheon**: An ancient temple with a magnificently preserved dome, a symbol of ancient Rome and architectural prowess.
5. **Tiber Island (Aventine)**: A scenic, historic peninsula that's worth exploring for its ancient ruins, including the Roman Forum and the Basilica of San Clemente.
6. **Tuolino Lake and its villages**: A picturesque lake with charming medieval towns and beautiful scenery.
7. **Cinque Terre**: A string of five colorful coastal towns on the Ligurian Sea, known for their hiking trails, beaches, and scenic views.
8. **Accademia Gallery**: Home to Michelangelo's famous sculpture, "David."
9. **Castel Sant'Angelo**: A former fortress turned museum with stunning views of Rome, once the palace of Emperor Hadrian.
10. **Monti and Pienza**: Areas with rich Renaissance architecture and charming villages.

Each of these sites offers a unique glimpse into Italy's diverse history, art, and culture, making for a rich and enriching trip.

[ Prompt: 230.3 t/s | Generation: 203.5 t/s ]

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

ggml/src/ggml-cpu/arch/arm/repack.cpp

ggml/src/ggml-cpu/repack.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Alcpz · 2026-01-27T09:06:06Z

@ggerganov Comments addressed. The failing CI task seems unrelated, I've seen it around in other PRs and it's affecting x64, which is not affected by this PR

…ions (i8mm) ggml-org#18860 (ggml-org#18888) * Boilerplate for q6_K repack * q6_K repack to q6_Kx8 implementation Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * q6_K generic gemv and gemm * wip, gemm_q6_K 8x8 * Still WIP: loading of q8s, q6h and q6l * first working version of q6_K gemm * Moved q6 loads outside of sb block, Unrolled inner loop * Replaced modulo with mask * First implementation of GEMV * ggml_vdotq_s32 -> vdotq_s32 * Reduce width of accumulators in q6_K gemv * Bsums instead of calc bias. Preload scales to use vget_lane. Unroll. * Reuse scales in GEMM (same GEMV opt) * Added todos for bsum and different qh repack * Arch fallback * VSLIQ for merging qh adn ql * Removed TODO, already tested * Apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Removed unused import --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Alcpz requested a review from ggerganov as a code owner January 16, 2026 23:42

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 17, 2026

Alcpz added 17 commits January 23, 2026 10:54

Boilerplate for q6_K repack

36496f3

q6_K repack to q6_Kx8 implementation

c9be47c

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

q6_K generic gemv and gemm

e036b51

wip, gemm_q6_K 8x8

219f5f8

Still WIP: loading of q8s, q6h and q6l

aaf1664

first working version of q6_K gemm

7761455

Moved q6 loads outside of sb block, Unrolled inner loop

3a0f925

Replaced modulo with mask

f6500a6

First implementation of GEMV

393d145

ggml_vdotq_s32 -> vdotq_s32

5366aad

Reduce width of accumulators in q6_K gemv

9cdaaee

Bsums instead of calc bias. Preload scales to use vget_lane. Unroll.

df0d15f

Reuse scales in GEMM (same GEMV opt)

6eb9f63

Added todos for bsum and different qh repack

64358c2

Arch fallback

6f542e3

VSLIQ for merging qh adn ql

a6d56c2

Removed TODO, already tested

8426ec2

Alcpz force-pushed the Alcpz/arm_q6_K_repack branch from 69be98f to 8426ec2 Compare January 26, 2026 10:13

ggerganov approved these changes Jan 26, 2026

View reviewed changes

ggml/src/ggml-cpu/arch/arm/repack.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/repack.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/repack.cpp Outdated Show resolved Hide resolved

Alcpz and others added 2 commits January 26, 2026 10:22

Apply suggestions

1427160

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Removed unused import

4a81fb6

loci-dev mentioned this pull request Jan 26, 2026

UPSTREAM PR #18888: ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860 auroralabs-loci/llama.cpp#1039

Open

1 task

ggerganov merged commit be8890e into ggml-org:master Jan 27, 2026
77 of 78 checks passed

Alcpz deleted the Alcpz/arm_q6_K_repack branch February 10, 2026 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860#18888

ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860#18888
ggerganov merged 19 commits intoggml-org:masterfrom
Alcpz:Alcpz/arm_q6_K_repack

Alcpz commented Jan 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Alcpz commented Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Alcpz commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

M4 (-DGGML_BLAS=OFF -DGGML_METAL=OFF)

Exynos 2400

Perplexity

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Alcpz commented Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Alcpz commented Jan 16, 2026 •

edited

Loading