UPSTREAM PR #17241: ggml-cpu: handle 3d tensors in repack mat_mul by DajanaV · Pull Request #191 · auroralabs-loci/llama.cpp

DajanaV · 2025-11-13T15:36:06Z

This is a continuation of #17030 after a performance regression was reported.

Perplexity Comparison (Repack vs Non-Repack)

Command:

MODELS="unsloth/Qwen3-8B-128K-GGUF:Q4_0 ggml-org/Meta-Llama-3.1-8B-Instruct-Q4_0-GGUF:Q4_0 LiquidAI/LFM2-700M-GGUF:Q4_0 LiquidAI/LFM2-1.2B-GGUF:Q4_0"
for d in build-cpu-aarm64 build-cpu-aarm64-norepack; do
    for model in $MODELS; do
        ${d}/bin/llama-perplexity -hf "$model" -f ./wikitext-2-raw/wiki.test.raw --chunks 20 -dev none
    done
done

Model	Repack PPL	Non-Repack PPL
LFM2-700M Q4_0	20.3324 ± 0.87133	20.3324 ± 0.87133
LFM2-1.2B Q4_0	15.7524 ± 0.63304	15.7524 ± 0.63304
Meta-Llama-3.1-8B-Instruct Q4_0	8.6578 ± 0.30323	8.6578 ± 0.30323
Qwen3-8B-128K Q4_0	11.1735 ± 0.48175	11.1735 ± 0.48175

Llama-bench

model	size	params	backend	threads	fa	test	t/s
qwen3 8B Q4_0	4.45 GiB	8.19 B	CPU	8	1	pp256	148.88 ± 0.60
qwen3 8B Q4_0	4.45 GiB	8.19 B	CPU	8	1	tg128	47.71 ± 0.35
llama 8B Q4_0	5.61 GiB	8.03 B	CPU	8	1	pp256	151.26 ± 1.94
llama 8B Q4_0	5.61 GiB	8.03 B	CPU	8	1	tg128	43.47 ± 0.78
lfm2 350M Q4_0	206.87 MiB	354.48 M	CPU	8	1	pp256	3248.97 ± 32.82
lfm2 350M Q4_0	206.87 MiB	354.48 M	CPU	8	1	tg128	562.68 ± 7.35
lfm2 700M Q4_0	423.37 MiB	742.49 M	CPU	8	1	pp256	1585.66 ± 13.60
lfm2 700M Q4_0	423.37 MiB	742.49 M	CPU	8	1	tg128	349.23 ± 2.42

build: c77bafd (6967) THIS PR

model	size	params	backend	threads	fa	test	t/s
qwen3 8B Q4_0	4.45 GiB	8.19 B	CPU	8	1	pp256	148.80 ± 0.18
qwen3 8B Q4_0	4.45 GiB	8.19 B	CPU	8	1	tg128	48.50 ± 0.81
llama 8B Q4_0	5.61 GiB	8.03 B	CPU	8	1	pp256	160.24 ± 0.76
llama 8B Q4_0	5.61 GiB	8.03 B	CPU	8	1	tg128	45.60 ± 0.17
lfm2 350M Q4_0	206.87 MiB	354.48 M	CPU	8	1	pp256	3269.37 ± 22.99
lfm2 350M Q4_0	206.87 MiB	354.48 M	CPU	8	1	tg128	595.18 ± 3.34
lfm2 700M Q4_0	423.37 MiB	742.49 M	CPU	8	1	pp256	1606.13 ± 8.51
lfm2 700M Q4_0	423.37 MiB	742.49 M	CPU	8	1	tg128	362.24 ± 3.19

build: 2776db6 (7047) MASTER

loci-review · 2025-11-13T16:14:39Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #191 - 3D Tensor Support in GGML CPU Repack

Overview

Pull Request #191 introduces 3D tensor support to the GGML CPU repack matrix multiplication system. The changes enable processing of transformer models with batch dimensions while maintaining numerical accuracy, but introduce measurable performance overhead in quantized operations.

Key Findings

Performance Impact:

Highest throughput degradation: forward_mul_mat function shows +37.52% throughput increase (2489 ns → 3423 ns), representing a performance regression in the core matrix multiplication path
Response time improvement: quantize_row_iq4_nl function shows -25.31% response time reduction (98 ns → 73 ns), indicating optimization in quantization operations
Power consumption: System-wide increase of +1.275% in build.bin.libggml-cpu.so, adding approximately 1936 nanojoules

Core Function Impact:
The changes affect critical inference components within the GGML backend system. The forward_mul_mat function is part of the high-performance inference pipeline for quantized matrix operations, directly impacting computational efficiency for IQ4_NL quantized models.

Inference Performance Impact:
Based on the reference model performance (ollama://smollm:135m on 12th Gen Intel i7-1255U), the 934 ns throughput increase in matrix multiplication operations may reduce tokens per second for workloads heavily utilizing IQ4_NL quantization, though the impact varies by model architecture and batch size.

Technical Analysis:

Flame Graph: Shows shallow execution structure with 80.6% time in main function body, indicating efficient core logic despite added complexity
CFG Comparison: Reveals identical control flow structure with performance improvements from compiler optimizations and better memory layout (50.6% improvement in assert path timing)
Code Review: Identifies increased computational complexity from 3D tensor indexing, nested loop overhead, and more complex pointer arithmetic as primary sources of throughput degradation

Affected Binaries:

build.bin.libggml-cpu.so: Primary impact with measurable power consumption increase
All other binaries show no performance changes

The implementation successfully enables 3D tensor processing while maintaining backward compatibility, with performance trade-offs concentrated in specific quantized operation paths.

Alcpz added 7 commits November 5, 2025 18:03

ggml-cpu: handle 3d tensors in repack mul_mat

950671d

Removed unnecessary branch, removed need for <algorithm>

0b86651

Fixed dst_ptr pointer in chunk + clang_format

75c7fd5

GGML_ASSERT to check wdata within bounds

edb7f63

Accidental ggml.h inclusion

b56d0ac

Improved GGML_ASSERT on wdata boundaries

d1938ad

Address performance regression in Qwen and llama.cpp due to chunking

c77bafd

DajanaV temporarily deployed to PROD__AL_DEMO November 13, 2025 15:36 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 21 times, most recently from 701e6c7 to 6196a56 Compare November 16, 2025 01:36

loci-dev force-pushed the main branch 30 times, most recently from 53eeb3f to 2531f8a Compare November 26, 2025 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17241: ggml-cpu: handle 3d tensors in repack mat_mul#191

UPSTREAM PR #17241: ggml-cpu: handle 3d tensors in repack mat_mul#191
DajanaV wants to merge 7 commits intomainfrom
upstream-PR17241-branch_Alcpz-Alcpz/batched_repack_mul_mat

DajanaV commented Nov 13, 2025

Uh oh!

loci-review bot commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DajanaV commented Nov 13, 2025

Perplexity Comparison (Repack vs Non-Repack)

Llama-bench

Uh oh!

loci-review bot commented Nov 13, 2025

Performance Analysis Summary: PR #191 - 3D Tensor Support in GGML CPU Repack

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants