Skip to content

ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860#18888

Merged
ggerganov merged 19 commits intoggml-org:masterfrom
Alcpz:Alcpz/arm_q6_K_repack
Jan 27, 2026
Merged

ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860#18888
ggerganov merged 19 commits intoggml-org:masterfrom
Alcpz:Alcpz/arm_q6_K_repack

Conversation

@Alcpz
Copy link
Copy Markdown
Collaborator

@Alcpz Alcpz commented Jan 16, 2026

Continuation of repack work for ARM, since q4_K_M and q5_K_M quantizations spend ~%20 of compute time on q6_K layers.

Same testing practices from the other repack implementations.

M4 (-DGGML_BLAS=OFF -DGGML_METAL=OFF)

model threads test master t/s this PR t/s Speedup
LFM2-1.2B Q4_K_M 8 pp512 589.99 685.00 1.16
LFM2-1.2B Q4_K_M 8 tg128 230.97 238.18 1.03
qwen3 8B Q4_K_M 8 pp512 88.34 106.44 1.20
qwen3 8B Q4_K_M 8 tg128 38.69 40.75 1.05
llama 8B Q4_K_M 8 pp512 89.19 106.74 1.20
llama 8B Q4_K_M 8 tg128 39.59 41.45 1.05
LFM2-1.2B Q6_K 8 pp512 282.56 591.21 2.09
LFM2-1.2B Q6_K 8 tg128 186.54 192.96 1.03
qwen3 8B Q6_K 8 pp512 42.93 94.97 2.21
qwen3 8B Q6_K 8 tg128 26.59 30.48 1.15
llama 8B Q6_K 8 pp512 43.01 91.11 2.12
llama 8B Q6_K 8 tg128 26.97 30.78 1.14

Exynos 2400

model threads test master t/s this PR t/s Speedup
LFM2-350M Q4_K_M 3 pp512 275.44 303.76 1.10
LFM2-350M Q4_K_M 3 tg128 75.02 93.92 1.25
LFM2-1.2B Q4_K_M 3 pp512 76.89 96.84 1.26
LFM2-1.2B Q4_K_M 3 tg128 27.12 33.33 1.23
LFM2-350M Q6_K 3 pp512 124.72 225.80 1.81
LFM2-350M Q6_K 3 tg128 54.13 61.33 1.13
LFM2-1.2B Q6_K 3 pp512 28.78 70.39 2.45
LFM2-1.2B Q6_K 3 tg128 20.99 21.77 1.04

Perplexity

model Repack OFF Repack ON Generic
LFM2 1.2B Q6_K 16.7901 ± 0.96392 16.7901 ± 0.96392 16.8182 ± 0.96585
Meta Llama 3.1 8B Instruct Q6_K 8.7109 ± 0.42663 8.7109 ± 0.42663 8.6987 ± 0.42586
Qwen3 8B 128K Q6_K 11.1063 ± 0.67258 11.1063 ± 0.67258 11.1544 ± 0.67856
llama-cli using repack
build      : b7779-4d328492c
model      : LFM2-350M-Q6_K.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is the capital of Italy?

The capital of Italy is Rome. It is a historic, vibrant city located on the northern peninsula of central Italy. Known for its rich history, art, architecture, and delicious food, Rome has been the political, cultural, and religious center of the country since the Roman Empire fell.

[ Prompt: 766.2 t/s | Generation: 417.9 t/s ]

> What would be great to eat there?

Rome is a culinary paradise with a plethora of delicious dishes to savor. Here are some great places to eat:

1. **Comforte**: This is a traditional Roman eatery serving classic Roman comfort food like ravioli, carbonara, and pizza romana (Pizza Romana). The platter is usually topped with prosciutto, fresh tomatoes, and basil.

2. **Osteria Morini**: Known for its elegant service and home-style tasting menu, Osteria Morini offers a wide range of Italian wines paired with dishes like ribollita, pappardelle al cinghiale (shepherd's pie), and panettone.

3. **Bistro**: For a more refined dining experience, Bistros like Giordano's and Le Pagliaccio provide high-quality, seasonal Italian cuisine, often with an Italian twist.

4. **Trattoria**: A casual eatery with friendly service, Trattorias like L'Artusi and Osteria Morini offer hearty dishes such as bistecca alla fiorentina (T-bone steak) and carbonara.

5. **Ristorante Carbone**: This upscale spot specializes in barbecued meats and French-inspired dishes, providing a sophisticated taste experience.

6. **Il Mulino**: A famous restaurant with a rich history, it serves authentic Neapolitan cuisine, including ravioli, arancini, and cicchetti (small plates).

7. **Luigi's Pizza**: For a quick and authentic pizza experience, visit Luigi's, which has been a staple in Rome since 1849.

These locations are just a few of the many excellent spots to enjoy Italian food in Rome. Enjoy exploring its many culinary delights!

[ Prompt: 588.5 t/s | Generation: 426.8 t/s ]
llama-cli using generic
  diff --git a/ggml/src/ggml-cpu/arch/arm/repack.cpp b/ggml/src/ggml-cpu/arch/arm/repack.cpp
  index e0f92798d..361692994 100644
  --- a/ggml/src/ggml-cpu/arch/arm/repack.cpp
  +++ b/ggml/src/ggml-cpu/arch/arm/repack.cpp
  @@ -1097,7 +1097,7 @@ void ggml_gemv_q6_K_8x8_q8_K(int                        n,
       UNUSED(ncols_interleaved);
       UNUSED(blocklen);

  -#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
  +#if 0 && defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
       constexpr int    col_pairs = ncols_interleaved / 2;
       const uint8x16_t m4b       = vdupq_n_u8(0x0f);
       const uint8x16_t mask_lo   = vdupq_n_u8(0x03);
  @@ -3495,7 +3495,7 @@ void ggml_gemm_q6_K_8x8_q8_K(int                        n,
       UNUSED(ncols_interleaved);
       UNUSED(blocklen);

  -#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8)
  +#if 0 && defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8)
       constexpr int    q8_k_blocklen = 4;
       const uint8x16_t m4b           = vdupq_n_u8(0x0f);
       const uint8x16_t mask_lo       = vdupq_n_u8(0x03);
build      : b7779-4d328492c
model      : LFM2-350M-Q6_K.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is the capital of Italy?

The capital of Italy is Rome. It's the largest city in the country and serves as its political, economic, and cultural center. With a rich history dating back to the Etruscans, Romans, and later, influences from the Byzantine and Aragonese Empires, Rome has played a pivotal role in shaping Western civilization.

[ Prompt: 203.8 t/s | Generation: 209.2 t/s ]

> > What would be great to visit in there?

Italy is incredibly rich in history and culture, offering a wide range of attractions to explore. Here are some great things to visit:

1. **The Colosseum**: An ancient amphitheater that showcases the grandeur of Roman architecture and gladiatorial games.
2. **The Vatican City**: Home to numerous important religious sites including the Sistine Chapel, St. Peter's Basilica, and the Vatican Museums, which house some of the world's most famous artworks.
3. **Piazza Navona**: A charming square with beautiful baroque architecture, offering a mix of history, culture, and entertainment.
4. **Rome's Pantheon**: An ancient temple with a magnificently preserved dome, a symbol of ancient Rome and architectural prowess.
5. **Tiber Island (Aventine)**: A scenic, historic peninsula that's worth exploring for its ancient ruins, including the Roman Forum and the Basilica of San Clemente.
6. **Tuolino Lake and its villages**: A picturesque lake with charming medieval towns and beautiful scenery.
7. **Cinque Terre**: A string of five colorful coastal towns on the Ligurian Sea, known for their hiking trails, beaches, and scenic views.
8. **Accademia Gallery**: Home to Michelangelo's famous sculpture, "David."
9. **Castel Sant'Angelo**: A former fortress turned museum with stunning views of Rome, once the palace of Emperor Hadrian.
10. **Monti and Pienza**: Areas with rich Renaissance architecture and charming villages.

Each of these sites offers a unique glimpse into Italy's diverse history, art, and culture, making for a rich and enriching trip.

[ Prompt: 230.3 t/s | Generation: 203.5 t/s ]

@Alcpz Alcpz requested a review from ggerganov as a code owner January 16, 2026 23:42
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 17, 2026
@Alcpz Alcpz force-pushed the Alcpz/arm_q6_K_repack branch from 69be98f to 8426ec2 Compare January 26, 2026 10:13
Alcpz and others added 2 commits January 26, 2026 10:22
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@Alcpz
Copy link
Copy Markdown
Collaborator Author

Alcpz commented Jan 27, 2026

@ggerganov Comments addressed. The failing CI task seems unrelated, I've seen it around in other PRs and it's affecting x64, which is not affected by this PR

@ggerganov ggerganov merged commit be8890e into ggml-org:master Jan 27, 2026
77 of 78 checks passed
shaofeiqi pushed a commit to qualcomm/llama.cpp that referenced this pull request Feb 6, 2026
…ions (i8mm) ggml-org#18860 (ggml-org#18888)

* Boilerplate for q6_K repack

* q6_K repack to q6_Kx8 implementation

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* q6_K generic gemv and gemm

* wip, gemm_q6_K 8x8

* Still WIP: loading of q8s, q6h and q6l

* first working version of q6_K gemm

* Moved q6 loads outside of sb block, Unrolled inner loop

* Replaced modulo with mask

* First implementation of GEMV

* ggml_vdotq_s32 -> vdotq_s32

* Reduce width of accumulators in q6_K gemv

* Bsums instead of calc bias. Preload scales to use vget_lane. Unroll.

* Reuse scales in GEMM (same GEMV opt)

* Added todos for bsum and different qh repack

* Arch fallback

* VSLIQ for merging qh adn ql

* Removed TODO, already tested

* Apply suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Removed unused import

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@Alcpz Alcpz deleted the Alcpz/arm_q6_K_repack branch February 10, 2026 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants