Skip to content

ggml-cpu: arm64: q6_K repack gemm and gemv (and generic) implementations (dotprod)#19360

Merged
Alcpz merged 5 commits intoggml-org:masterfrom
Alcpz:Alcpz/arm_q6_K_dotprod
Feb 10, 2026
Merged

ggml-cpu: arm64: q6_K repack gemm and gemv (and generic) implementations (dotprod)#19360
Alcpz merged 5 commits intoggml-org:masterfrom
Alcpz:Alcpz/arm_q6_K_dotprod

Conversation

@Alcpz
Copy link
Copy Markdown
Collaborator

@Alcpz Alcpz commented Feb 5, 2026

#19356 but Q6_K.

PR contents:

Same methodology for testing -> llama-cli output, outputs of gemm and gemvs and perplexity to double check prompt processing.

Performance

  • Apple M4 Max (-mcpu=cortex-a76+dotprod+noi8mm+nosve)
Model Test Repack OFF (t/s) Repack ON (t/s) Speedup
lfm2 1.2B Q6_K pp512 221.62 686.86 3.10
lfm2 1.2B Q6_K tg128 150.49 184.11 1.22
qwen3 8B Q6_K pp512 30.10 94.76 3.15
qwen3 8B Q6_K tg128 23.34 30.02 1.29
  • Rpi5
Model Test Repack OFF (t/s) Repack ON (t/s) Speedup
lfm2 350M Q6_K pp512 86.21 224.06 2.60
lfm2 350M Q6_K tg128 39.45 38.51 0.98
lfm2 700M Q6_K pp512 39.93 105.82 2.65
lfm2 700M Q6_K tg128 19.08 17.83 0.93

Perplexity

model Repack ON Generic Repack OFF
LFM2-1.2B 16.7641 +/- 0.96182 16.7624 +/- 0.96229 16.7641 +/- 0.96182
Qwen3-8B 11.2017 +/- 0.68155 11.2060 +/- 0.68156 11.2017 +/- 0.68155

llama-cli

llama-cli using repack
build      : b7849-68e615e51
model      : LFM2-1.2B-Q6_K.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is the capital of Brasil?

The capital of Brazil is Brasília. Located in the central part of the country, Brasília serves as the political, economic, and cultural center of Brazil. It was designed by the Brazilian architect Oscar Niemeyer and his associate Lúcio Costa, and it was officially inaugurated as the capital in 1960.

Brasília was chosen to be the capital primarily for its central location within Brazil, as well as to serve as a symbol of modernity and progress during Brazil's economic and political development. The city was built to house government institutions and is notable for its futuristic architecture and wide avenues, which were part of an ambitious urban planning project.

Brasília is not only the capital but also a UNESCO World Heritage Site, recognized for its innovative design and its role in shaping urban planning in the 20th century. The city is surrounded by the Green Belt, a vast area of forest that helps preserve the environment around it.

[ Prompt: 509.5 t/s | Generation: 169.9 t/s ]

> I thought it was Sao Paulo!

I apologize for the mistake. The capital of Brazil is actually Brasília, not Sao Paulo. Brasília was chosen as the capital in 1960 as part of a national plan to move the center of government away from the densely populated coastal cities and to promote development in the interior of the country.

Brasília is known for its modernist architecture and is the political, economic, and cultural heart of Brazil. It was designed by architects Lúcio Costa and Oscar Niemeyer, and it features a unique urban layout with wide avenues and planned neighborhoods.

While Sao Paulo is indeed the largest city in Brazil and a major economic hub, it is not the capital. If you have any more questions about Brazil or any other topic, feel free to ask!

[ Prompt: 452.1 t/s | Generation: 169.7 t/s ]
llama-cli using generic
build      : b7849-68e615e51
model      : LFM2-1.2B-Q6_K.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is the capital of Brasil?

The capital of Brazil is Brasília. It is a city known for its modernist architecture, expansive green spaces, and as the seat of the Federal District of Brazil, the country’s political center. Brasília was designed by urban planner Lúcio Costa and was inaugurated as the capital in 1960, replacing Rio de Janeiro. The city is a symbol of Brazil's rapid modernization and development in the mid-20th century.

[ Prompt: 36.8 t/s | Generation: 29.6 t/s ]

> I thought it was Sao Paulo!

You're correct that Sao Paulo is often mistaken for being the capital of Brazil, but that's not the case. The capital of Brazil is Brasília. Sao Paulo is the largest city in Brazil in terms of population and is also a major economic center, but it is not the capital. Brasília was chosen specifically to house the federal government and other key institutions to promote development in the northern part of the country and to decentralize political power from the coastal states, particularly Rio de Janeiro.

[ Prompt: 35.9 t/s | Generation: 29.6 t/s ]

>

@Alcpz Alcpz requested a review from ggerganov as a code owner February 5, 2026 12:17
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 5, 2026
@Alcpz
Copy link
Copy Markdown
Collaborator Author

Alcpz commented Feb 5, 2026

@tdakhran

@Alcpz Alcpz force-pushed the Alcpz/arm_q6_K_dotprod branch from 68e615e to 8d1d4b3 Compare February 9, 2026 09:18
Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to merge now (use squash + merge).

@Alcpz Alcpz merged commit c03a5a4 into ggml-org:master Feb 10, 2026
78 checks passed
@Alcpz Alcpz deleted the Alcpz/arm_q6_K_dotprod branch February 10, 2026 17:06
UNUSED(bs);
UNUSED(nr);

float sumf[8];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alcpz, this should use the templated parameter N.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address in a different PR. thanks for flagging

liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
…ons (dotprod) (ggml-org#19360)

* First working version of GEMM and GEMV

* interleave loads and compute

* Clang-format

* Added missing fallback. Removed tested TODO.

* Swap M and N to be consistent with the repack template convention
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
…ons (dotprod) (ggml-org#19360)

* First working version of GEMM and GEMV

* interleave loads and compute

* Clang-format

* Added missing fallback. Removed tested TODO.

* Swap M and N to be consistent with the repack template convention
@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 8, 2026

May I know why we have the same repack format for all ISAs? Wouldn't it be better to specialise it per ISA? I see at least on x86 it could be made much faster if we don't have permute and shuffle. cc @ggerganov

@ggerganov
Copy link
Copy Markdown
Member

What do you mean - I think we select the repack type per ISA here:

static const ggml::cpu::tensor_traits * ggml_repack_get_optimal_repack_type(const struct ggml_tensor * cur) {
// instance for Q4
static const ggml::cpu::repack::tensor_traits<block_q4_0, 4, 4, GGML_TYPE_Q8_0> q4_0_4x4_q8_0;
static const ggml::cpu::repack::tensor_traits<block_q4_0, 8, 4, GGML_TYPE_Q8_0> q4_0_4x8_q8_0;
static const ggml::cpu::repack::tensor_traits<block_q4_0, 8, 8, GGML_TYPE_Q8_0> q4_0_8x8_q8_0;
// instance for Q4_K
static const ggml::cpu::repack::tensor_traits<block_q4_K, 4, 8, GGML_TYPE_Q8_K> q4_K_8x4_q8_K;
static const ggml::cpu::repack::tensor_traits<block_q4_K, 8, 8, GGML_TYPE_Q8_K> q4_K_8x8_q8_K;
// instance for Q5_K
static const ggml::cpu::repack::tensor_traits<block_q5_K, 4, 8, GGML_TYPE_Q8_K> q5_K_8x4_q8_K;
static const ggml::cpu::repack::tensor_traits<block_q5_K, 8, 8, GGML_TYPE_Q8_K> q5_K_8x8_q8_K;
// instance for Q6_K
static const ggml::cpu::repack::tensor_traits<block_q6_K, 4, 8, GGML_TYPE_Q8_K> q6_K_8x4_q8_K;
static const ggml::cpu::repack::tensor_traits<block_q6_K, 8, 8, GGML_TYPE_Q8_K> q6_K_8x8_q8_K;
// instance for Q2
static const ggml::cpu::repack::tensor_traits<block_q2_K, 8, 8, GGML_TYPE_Q8_K> q2_K_8x8_q8_K;
// instance for IQ4
static const ggml::cpu::repack::tensor_traits<block_iq4_nl, 4, 4, GGML_TYPE_Q8_0> iq4_nl_4x4_q8_0;
static const ggml::cpu::repack::tensor_traits<block_iq4_nl, 8, 8, GGML_TYPE_Q8_0> iq4_nl_8x8_q8_0;
// instance for MXFP4
static const ggml::cpu::repack::tensor_traits<block_mxfp4, 4, 4, GGML_TYPE_Q8_0> mxfp4_4x4_q8_0;
static const ggml::cpu::repack::tensor_traits<block_mxfp4, 8, 8, GGML_TYPE_Q8_0> mxfp4_8x8_q8_0;
// instance for Q8_0
static const ggml::cpu::repack::tensor_traits<block_q8_0, 4, 4, GGML_TYPE_Q8_0> q8_0_4x4_q8_0;
static const ggml::cpu::repack::tensor_traits<block_q8_0, 8, 4, GGML_TYPE_Q8_0> q8_0_4x8_q8_0;
if (cur->type == GGML_TYPE_Q4_0) {
if (ggml_cpu_has_avx2() || (ggml_cpu_has_sve() && ggml_cpu_has_matmul_int8() && ggml_cpu_get_sve_cnt() == QK8_0)
|| (ggml_cpu_has_riscv_v() && (ggml_cpu_get_rvv_vlen() >= QK4_0))) {
if (cur->ne[1] % 8 == 0) {
return &q4_0_8x8_q8_0;
}
}
if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
if (cur->ne[1] % 4 == 0) {
return &q4_0_4x8_q8_0;
}
}
if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
if (cur->ne[1] % 4 == 0) {
return &q4_0_4x4_q8_0;
}
}
} else if (cur->type == GGML_TYPE_Q4_K) {
if (ggml_cpu_has_avx2()) {
if (cur->ne[1] % 8 == 0) {
return &q4_K_8x8_q8_K;
}
}
if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
if (cur->ne[1] % 8 == 0) {
return &q4_K_8x8_q8_K;
}
}
if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
if (cur->ne[1] % 8 == 0) {
return &q4_K_8x4_q8_K;
}
}
} else if (cur->type == GGML_TYPE_Q2_K) {
if (ggml_cpu_has_avx512()) {
if (cur->ne[1] % 8 == 0) {
return &q2_K_8x8_q8_K;
}
}
} else if (cur->type == GGML_TYPE_Q5_K) {
if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
if (cur->ne[1] % 8 == 0) {
return &q5_K_8x8_q8_K;
}
}
if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
if (cur->ne[1] % 8 == 0) {
return &q5_K_8x4_q8_K;
}
}
} else if (cur->type == GGML_TYPE_Q6_K) {
if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
if (cur->ne[1] % 8 == 0) {
return &q6_K_8x8_q8_K;
}
}
if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
if (cur->ne[1] % 8 == 0) {
return &q6_K_8x4_q8_K;
}
}
} else if (cur->type == GGML_TYPE_IQ4_NL) {
if (ggml_cpu_has_avx2()) {
if (cur->ne[1] % 8 == 0) {
return &iq4_nl_8x8_q8_0;
}
}
if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
if (cur->ne[1] % 4 == 0) {
return &iq4_nl_4x4_q8_0;
}
}
} else if (cur->type == GGML_TYPE_MXFP4) {
if (ggml_cpu_has_avx2()) {
if (cur->ne[1] % 8 == 0) {
return &mxfp4_8x8_q8_0;
}
}
if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
if (cur->ne[1] % 4 == 0) {
return &mxfp4_4x4_q8_0;
}
}
} else if (cur->type == GGML_TYPE_Q8_0) {
if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
if (cur->ne[1] % 4 == 0) {
return &q8_0_4x8_q8_0;
}
}
if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
if (cur->ne[1] % 4 == 0) {
return &q8_0_4x4_q8_0;
}
}
}
return nullptr;
}

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 9, 2026

Oh right, then technically I can create a separate repack for x86 and see if it helps. Thanks!

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 9, 2026

Basically the idea was that we do a fixup for getting maddubs to work using the shuffle and blend, we can instead to have 4 byte interleave (instead of 8) to directly use maddubs

@ggerganov
Copy link
Copy Markdown
Member

Yes, should be possible to specialize the repacks any way you need. It's just a balance of code complexity and also lack of testing infrastructure that makes it a bit difficult to validate the repack implementation.s

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 9, 2026

Also, while we're on the topic. Do you have any ideas to optimize the mul-mat-id vec implementation used in hybrid inference? That path doesn't use repack and it could benefit from the generally faster mul-mat-vec impl. Two ideas I have was to use accumulate multiple rows together to reuse the src1 activation, and the other was to fuse the mul-mat together with the gate (like we do in CUDA)

@ggerganov
Copy link
Copy Markdown
Member

I did some work on this in #14918. It can be improved for sure, but offloading to the GPU is pretty much always better. I just don't see any interesting use cases for CPU inference, so it hasn't been a priority for me.

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 9, 2026

I think the very relevant use-case is still --n-cpu-moe, which a lot of people use to run on consumer non-Apple hardware. (Speaking from the anecdotal evidence at r/localllama). So I will try to squeeze out something there (though pretty sure the data-transfer dominates)

@ggerganov
Copy link
Copy Markdown
Member

I see - yes, the vec path for mmid is useful to optimize. But I haven't looked into that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants