ggml-cpu: arm64: q6_K repack gemm and gemv (and generic) implementations (dotprod) by Alcpz · Pull Request #19360 · ggml-org/llama.cpp

Alcpz · 2026-02-05T12:17:43Z

#19356 but Q6_K.

PR contents:

New generics for q6_K_8x4
New repack implementations for ARM
Templated generic impl (Will be discussed in ggml-cpu: arm64: q5_K repack gemm and gemv (and generic) implementations (dotprod) #19356)

Same methodology for testing -> llama-cli output, outputs of gemm and gemvs and perplexity to double check prompt processing.

Performance

Apple M4 Max (-mcpu=cortex-a76+dotprod+noi8mm+nosve)

Model	Test	Repack OFF (t/s)	Repack ON (t/s)	Speedup
lfm2 1.2B Q6_K	pp512	221.62	686.86	3.10
lfm2 1.2B Q6_K	tg128	150.49	184.11	1.22
qwen3 8B Q6_K	pp512	30.10	94.76	3.15
qwen3 8B Q6_K	tg128	23.34	30.02	1.29

Rpi5

Model	Test	Repack OFF (t/s)	Repack ON (t/s)	Speedup
lfm2 350M Q6_K	pp512	86.21	224.06	2.60
lfm2 350M Q6_K	tg128	39.45	38.51	0.98
lfm2 700M Q6_K	pp512	39.93	105.82	2.65
lfm2 700M Q6_K	tg128	19.08	17.83	0.93

Perplexity

model	Repack ON	Generic	Repack OFF
LFM2-1.2B	16.7641 +/- 0.96182	16.7624 +/- 0.96229	16.7641 +/- 0.96182
Qwen3-8B	11.2017 +/- 0.68155	11.2060 +/- 0.68156	11.2017 +/- 0.68155

llama-cli

llama-cli using repack

build      : b7849-68e615e51
model      : LFM2-1.2B-Q6_K.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is the capital of Brasil?

The capital of Brazil is Brasília. Located in the central part of the country, Brasília serves as the political, economic, and cultural center of Brazil. It was designed by the Brazilian architect Oscar Niemeyer and his associate Lúcio Costa, and it was officially inaugurated as the capital in 1960.

Brasília was chosen to be the capital primarily for its central location within Brazil, as well as to serve as a symbol of modernity and progress during Brazil's economic and political development. The city was built to house government institutions and is notable for its futuristic architecture and wide avenues, which were part of an ambitious urban planning project.

Brasília is not only the capital but also a UNESCO World Heritage Site, recognized for its innovative design and its role in shaping urban planning in the 20th century. The city is surrounded by the Green Belt, a vast area of forest that helps preserve the environment around it.

[ Prompt: 509.5 t/s | Generation: 169.9 t/s ]

> I thought it was Sao Paulo!

I apologize for the mistake. The capital of Brazil is actually Brasília, not Sao Paulo. Brasília was chosen as the capital in 1960 as part of a national plan to move the center of government away from the densely populated coastal cities and to promote development in the interior of the country.

Brasília is known for its modernist architecture and is the political, economic, and cultural heart of Brazil. It was designed by architects Lúcio Costa and Oscar Niemeyer, and it features a unique urban layout with wide avenues and planned neighborhoods.

While Sao Paulo is indeed the largest city in Brazil and a major economic hub, it is not the capital. If you have any more questions about Brazil or any other topic, feel free to ask!

[ Prompt: 452.1 t/s | Generation: 169.7 t/s ]

llama-cli using generic

build      : b7849-68e615e51
model      : LFM2-1.2B-Q6_K.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is the capital of Brasil?

The capital of Brazil is Brasília. It is a city known for its modernist architecture, expansive green spaces, and as the seat of the Federal District of Brazil, the country’s political center. Brasília was designed by urban planner Lúcio Costa and was inaugurated as the capital in 1960, replacing Rio de Janeiro. The city is a symbol of Brazil's rapid modernization and development in the mid-20th century.

[ Prompt: 36.8 t/s | Generation: 29.6 t/s ]

> I thought it was Sao Paulo!

You're correct that Sao Paulo is often mistaken for being the capital of Brazil, but that's not the case. The capital of Brazil is Brasília. Sao Paulo is the largest city in Brazil in terms of population and is also a major economic center, but it is not the capital. Brasília was chosen specifically to house the federal government and other key institutions to promote development in the northern part of the country and to decentralize political power from the coastal states, particularly Rio de Janeiro.

[ Prompt: 35.9 t/s | Generation: 29.6 t/s ]

>

Alcpz · 2026-02-05T13:34:09Z

@tdakhran

ggerganov

You should be able to merge now (use squash + merge).

taimur-10x · 2026-02-13T14:16:04Z

ggml/src/ggml-cpu/repack.cpp

+    UNUSED(bs);
+    UNUSED(nr);
+
+    float sumf[8];


@Alcpz, this should use the templated parameter N.

Will address in a different PR. thanks for flagging

…ons (dotprod) (ggml-org#19360) * First working version of GEMM and GEMV * interleave loads and compute * Clang-format * Added missing fallback. Removed tested TODO. * Swap M and N to be consistent with the repack template convention

am17an · 2026-03-08T11:13:08Z

May I know why we have the same repack format for all ISAs? Wouldn't it be better to specialise it per ISA? I see at least on x86 it could be made much faster if we don't have permute and shuffle. cc @ggerganov

ggerganov · 2026-03-09T07:53:45Z

What do you mean - I think we select the repack type per ISA here:

llama.cpp/ggml/src/ggml-cpu/repack.cpp

Lines 3392 to 3522 in e2763a6

    
           static const ggml::cpu::tensor_traits * ggml_repack_get_optimal_repack_type(const struct ggml_tensor * cur) { 
        
               // instance for Q4 
        
               static const ggml::cpu::repack::tensor_traits<block_q4_0, 4, 4, GGML_TYPE_Q8_0> q4_0_4x4_q8_0; 
        
               static const ggml::cpu::repack::tensor_traits<block_q4_0, 8, 4, GGML_TYPE_Q8_0> q4_0_4x8_q8_0; 
        
               static const ggml::cpu::repack::tensor_traits<block_q4_0, 8, 8, GGML_TYPE_Q8_0> q4_0_8x8_q8_0; 
        
               // instance for Q4_K 
        
               static const ggml::cpu::repack::tensor_traits<block_q4_K, 4, 8, GGML_TYPE_Q8_K> q4_K_8x4_q8_K; 
        
               static const ggml::cpu::repack::tensor_traits<block_q4_K, 8, 8, GGML_TYPE_Q8_K> q4_K_8x8_q8_K; 
        
               // instance for Q5_K 
        
               static const ggml::cpu::repack::tensor_traits<block_q5_K, 4, 8, GGML_TYPE_Q8_K> q5_K_8x4_q8_K; 
        
               static const ggml::cpu::repack::tensor_traits<block_q5_K, 8, 8, GGML_TYPE_Q8_K> q5_K_8x8_q8_K; 
        
               // instance for Q6_K 
        
               static const ggml::cpu::repack::tensor_traits<block_q6_K, 4, 8, GGML_TYPE_Q8_K> q6_K_8x4_q8_K; 
        
               static const ggml::cpu::repack::tensor_traits<block_q6_K, 8, 8, GGML_TYPE_Q8_K> q6_K_8x8_q8_K; 
        
               // instance for Q2 
        
               static const ggml::cpu::repack::tensor_traits<block_q2_K, 8, 8, GGML_TYPE_Q8_K> q2_K_8x8_q8_K; 
        
               // instance for IQ4 
        
               static const ggml::cpu::repack::tensor_traits<block_iq4_nl, 4, 4, GGML_TYPE_Q8_0> iq4_nl_4x4_q8_0; 
        
               static const ggml::cpu::repack::tensor_traits<block_iq4_nl, 8, 8, GGML_TYPE_Q8_0> iq4_nl_8x8_q8_0; 
        
               // instance for MXFP4 
        
               static const ggml::cpu::repack::tensor_traits<block_mxfp4, 4, 4, GGML_TYPE_Q8_0> mxfp4_4x4_q8_0; 
        
               static const ggml::cpu::repack::tensor_traits<block_mxfp4, 8, 8, GGML_TYPE_Q8_0> mxfp4_8x8_q8_0; 
        
               // instance for Q8_0 
        
               static const ggml::cpu::repack::tensor_traits<block_q8_0, 4, 4, GGML_TYPE_Q8_0> q8_0_4x4_q8_0; 
        
               static const ggml::cpu::repack::tensor_traits<block_q8_0, 8, 4, GGML_TYPE_Q8_0> q8_0_4x8_q8_0; 
        
               if (cur->type == GGML_TYPE_Q4_0) { 
        
                   if (ggml_cpu_has_avx2() || (ggml_cpu_has_sve() && ggml_cpu_has_matmul_int8() && ggml_cpu_get_sve_cnt() == QK8_0) 
        
                       || (ggml_cpu_has_riscv_v() && (ggml_cpu_get_rvv_vlen() >= QK4_0))) { 
        
                       if (cur->ne[1] % 8 == 0) { 
        
                           return &q4_0_8x8_q8_0; 
        
                       } 
        
                   } 
        
                   if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) { 
        
                       if (cur->ne[1] % 4 == 0) { 
        
                           return &q4_0_4x8_q8_0; 
        
                       } 
        
                   } 
        
                   if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) { 
        
                       if (cur->ne[1] % 4 == 0) { 
        
                           return &q4_0_4x4_q8_0; 
        
                       } 
        
                   } 
        
               } else if (cur->type == GGML_TYPE_Q4_K) { 
        
                   if (ggml_cpu_has_avx2()) { 
        
                       if (cur->ne[1] % 8 == 0) { 
        
                           return &q4_K_8x8_q8_K; 
        
                       } 
        
                   } 
        
                   if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) { 
        
                       if (cur->ne[1] % 8 == 0) { 
        
                           return &q4_K_8x8_q8_K; 
        
                       } 
        
                   } 
        
                   if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) { 
        
                       if (cur->ne[1] % 8 == 0) { 
        
                           return &q4_K_8x4_q8_K; 
        
                       } 
        
                   } 
        
               } else if (cur->type == GGML_TYPE_Q2_K) { 
        
                   if (ggml_cpu_has_avx512()) { 
        
                       if (cur->ne[1] % 8 == 0) { 
        
                           return &q2_K_8x8_q8_K; 
        
                       } 
        
                   } 
        
               } else if (cur->type == GGML_TYPE_Q5_K) { 
        
                   if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) { 
        
                       if (cur->ne[1] % 8 == 0) { 
        
                           return &q5_K_8x8_q8_K; 
        
                       } 
        
                   } 
        
                   if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) { 
        
                       if (cur->ne[1] % 8 == 0) { 
        
                           return &q5_K_8x4_q8_K; 
        
                       } 
        
                   } 
        
               } else if (cur->type == GGML_TYPE_Q6_K) { 
        
                   if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) { 
        
                       if (cur->ne[1] % 8 == 0) { 
        
                           return &q6_K_8x8_q8_K; 
        
                       } 
        
                   } 
        
                   if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) { 
        
                       if (cur->ne[1] % 8 == 0) { 
        
                           return &q6_K_8x4_q8_K; 
        
                       } 
        
                   } 
        
               } else if (cur->type == GGML_TYPE_IQ4_NL) { 
        
                   if (ggml_cpu_has_avx2()) { 
        
                       if (cur->ne[1] % 8 == 0) { 
        
                           return &iq4_nl_8x8_q8_0; 
        
                       } 
        
                   } 
        
                   if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) { 
        
                       if (cur->ne[1] % 4 == 0) { 
        
                           return &iq4_nl_4x4_q8_0; 
        
                       } 
        
                   } 
        
               } else if (cur->type == GGML_TYPE_MXFP4) { 
        
                   if (ggml_cpu_has_avx2()) { 
        
                       if (cur->ne[1] % 8 == 0) { 
        
                           return &mxfp4_8x8_q8_0; 
        
                       } 
        
                   } 
        
                   if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) { 
        
                       if (cur->ne[1] % 4 == 0) { 
        
                           return &mxfp4_4x4_q8_0; 
        
                       } 
        
                   } 
        
               } else if (cur->type == GGML_TYPE_Q8_0) { 
        
                   if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) { 
        
                       if (cur->ne[1] % 4 == 0) { 
        
                           return &q8_0_4x8_q8_0; 
        
                       } 
        
                   } 
        
                   if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) { 
        
                       if (cur->ne[1] % 4 == 0) { 
        
                           return &q8_0_4x4_q8_0; 
        
                       } 
        
                   } 
        
               } 
        
               return nullptr; 
        
           }

am17an · 2026-03-09T08:33:10Z

Oh right, then technically I can create a separate repack for x86 and see if it helps. Thanks!

am17an · 2026-03-09T08:38:05Z

Basically the idea was that we do a fixup for getting maddubs to work using the shuffle and blend, we can instead to have 4 byte interleave (instead of 8) to directly use maddubs

ggerganov · 2026-03-09T08:41:29Z

Yes, should be possible to specialize the repacks any way you need. It's just a balance of code complexity and also lack of testing infrastructure that makes it a bit difficult to validate the repack implementation.s

am17an · 2026-03-09T08:46:48Z

Also, while we're on the topic. Do you have any ideas to optimize the mul-mat-id vec implementation used in hybrid inference? That path doesn't use repack and it could benefit from the generally faster mul-mat-vec impl. Two ideas I have was to use accumulate multiple rows together to reuse the src1 activation, and the other was to fuse the mul-mat together with the gate (like we do in CUDA)

ggerganov · 2026-03-09T08:54:40Z

I did some work on this in #14918. It can be improved for sure, but offloading to the GPU is pretty much always better. I just don't see any interesting use cases for CPU inference, so it hasn't been a priority for me.

am17an · 2026-03-09T08:59:17Z

I think the very relevant use-case is still --n-cpu-moe, which a lot of people use to run on consumer non-Apple hardware. (Speaking from the anecdotal evidence at r/localllama). So I will try to squeeze out something there (though pretty sure the data-transfer dominates)

ggerganov · 2026-03-09T09:12:32Z

I see - yes, the vec path for mmid is useful to optimize. But I haven't looked into that.

Alcpz requested a review from ggerganov as a code owner February 5, 2026 12:17

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 5, 2026

Alcpz added 4 commits February 9, 2026 09:17

First working version of GEMM and GEMV

0189ff8

interleave loads and compute

0852b7a

Clang-format

cf820d2

Added missing fallback. Removed tested TODO.

8d1d4b3

Alcpz force-pushed the Alcpz/arm_q6_K_dotprod branch from 68e615e to 8d1d4b3 Compare February 9, 2026 09:18

ggerganov approved these changes Feb 9, 2026

View reviewed changes

Alcpz mentioned this pull request Feb 9, 2026

ggml-cpu: arm64: q5_K repack gemm and gemv (and generic) implementations (dotprod) #19356

Merged

Swap M and N to be consistent with the repack template convention

4c35476

Alcpz merged commit c03a5a4 into ggml-org:master Feb 10, 2026
78 checks passed

Alcpz deleted the Alcpz/arm_q6_K_dotprod branch February 10, 2026 17:06

taimur-10x reviewed Feb 13, 2026

View reviewed changes

Conversation

Alcpz commented Feb 5, 2026

Performance

Perplexity

llama-cli

Uh oh!

Alcpz commented Feb 5, 2026

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

taimur-10x Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Alcpz Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

am17an commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 9, 2026

Uh oh!

am17an commented Mar 9, 2026

Uh oh!

am17an commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 9, 2026

Uh oh!

am17an commented Mar 9, 2026

Uh oh!

ggerganov commented Mar 9, 2026

Uh oh!

am17an commented Mar 9, 2026

Uh oh!

ggerganov commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

am17an commented Mar 8, 2026 •

edited

Loading

am17an commented Mar 9, 2026 •

edited

Loading