UPSTREAM PR #17030: ggml-cpu: handle 3d tensors in repack mat_mul by DajanaV · Pull Request #94 · auroralabs-loci/llama.cpp

DajanaV · 2025-11-05T17:36:28Z

While testing #16739, perplexities for LFM2 skyrocketed. @ggerganov pointed out that some matrix shapes would probably not be supported.

LFM2 has some layers that have two batches, so MAT_MULs were only done partially, leading to incorrect results. See ggml-org/llama.cpp#16739 (comment)

This patch adds basic support for tensors with ne2 > 1, with very naive chunking based on the non repack MUL MAT.

Perplexities using this patch:

bin/llama-perplexity \
  -hf LiquidAI/LFM2-1.2B-GGUF:Q4_0  \
  -f ../wikitext-2-raw/wiki.test.raw --chunks 20
# REPACK
[1]9.9763,[2]15.1558,[3]13.9708,[4]13.7465,[5]14.2039,[6]14.6234,[7]14.6543,[8]15.6984,[9]16.7691,[10]17.1773,[11]16.9814,[12]17.2111,[13]17.7539,[14]17.2013,[15]16.8515,[16]16.9276,[17]15.8386,[18]16.1010,[19]15.9863,[20]15.7344,
Final estimate: PPL = 15.7344 +/- 0.63198
#NO REPACK
[1]9.9763,[2]15.1558,[3]13.9708,[4]13.7465,[5]14.2039,[6]14.6234,[7]14.6543,[8]15.6984,[9]16.7691,[10]17.1773,[11]16.9814,[12]17.2111,[13]17.7539,[14]17.2013,[15]16.8515,[16]16.9276,[17]15.8386,[18]16.1010,[19]15.9863,[20]15.7344,
Final estimate: PPL = 15.7344 +/- 0.63198

I can provide logs for other models if needed.

loci-review · 2025-11-05T18:12:34Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #94 - 3D Tensor Support in Matrix Multiplication

Overview

PR #94 introduces 3D tensor support for batched matrix multiplication operations in the GGML CPU backend, specifically targeting models like LFM2 that require ne2 > 1 tensor processing. The changes modify the forward_mul_mat function in ggml/src/ggml-cpu/repack.cpp to handle multi-batch operations through enhanced chunking strategies.

Key Findings

Performance Impact:

Highest Throughput Degradation: forward_mul_mat function shows 36% increase (2,514 ns → 3,423 ns, +909 ns)
Highest Response Time Degradation: ggml_set_op_params_f32 shows 17% increase (102 ns → 120 ns, +18 ns)

Core Function Impact:
The changes do not directly affect primary inference functions (llama_decode, llama_encode, llama_tokenize) that drive tokens-per-second performance. The modified forward_mul_mat function operates at the GGML backend level for matrix operations, meaning token throughput remains unaffected for standard inference workloads.

Power Consumption Analysis:

build.bin.libggml-cpu.so shows 1.44% increase in power consumption (+2,166 nJ)
All other binaries show negligible changes (≤0.001%)
Power increase correlates directly with increased CPU cycles in matrix multiplication operations

Technical Analysis:

Flame Graph: Reveals assertion failure paths consuming 46% more execution time due to relocated error message strings affecting cache locality
CFG Comparison: Identical control flow structure with performance regression attributed to memory layout changes rather than algorithmic modifications
Code Review: Implementation adds computational overhead through batch index calculations, nested quantization loops, and two-dimensional chunking logic

Implementation Details:
The changes introduce necessary complexity for 3D tensor support while maintaining correctness. The performance overhead stems from enhanced parameter validation, complex memory addressing patterns, and expanded working set requirements (nbw2 * ne12 vs nbw1 * ne11).

Actionable Recommendations:

Consider conditional compilation to avoid 3D overhead for 2D operations
Optimize string literal placement for improved cache locality
Implement adaptive chunking based on tensor dimensions

The modifications successfully enable batched matrix operations for advanced model architectures while introducing acceptable performance overhead in specialized code paths.

loci-review · 2025-11-05T19:19:01Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #94 - 3D Tensor Support

Overview

PR #94 introduces 3D tensor support for repack matrix multiplication operations, specifically targeting models like LFM2 that require batched operations with ne2 > 1. The changes are localized to the ggml-cpu/repack.cpp file but introduce measurable performance overhead.

Key Findings

Performance Impact:

Highest Response Time Change: ggml_set_op_params shows +22.32% increase (131 ns → 160 ns)
Highest Throughput Change: forward_mul_mat (tensor_traits<block_iq4_nl>) shows +36.85% increase (2,514 ns → 3,440 ns)

Core Function Impact:
The changes do not directly affect primary inference functions (llama_decode, llama_encode, llama_tokenize). The performance degradation occurs in lower-level GGML tensor operations, which means tokens per second performance should remain largely unaffected for typical inference workloads.

Power Consumption Analysis:

build.bin.libggml-cpu.so shows +1.25% power consumption increase (150,656 nJ → 152,537 nJ)
All other binaries show no measurable power impact
The increase correlates directly with throughput degradations in CPU tensor operations

Technical Analysis:

Flame Graph: Shows 87.4% execution time concentrated in ggml_set_op_params itself, indicating performance regression stems from internal logic rather than external function calls
CFG Comparison: Reveals memory layout changes causing 57% execution time increase in assert paths and 36% in error paths due to different data section organization
Code Review: Identifies root cause as additional batch index calculations, complex pointer arithmetic, and nested quantization loops for 3D tensor support

Implementation Changes:
The modifications add computational overhead through:

Batch index calculations per function call
More complex memory addressing patterns
2D chunking strategy increasing thread coordination complexity

Scope Assessment:
Changes are functionally necessary for 3D tensor support but introduce measured overhead in matrix multiplication operations. The performance impact is contained within GGML backend operations and should not significantly affect end-user inference performance for standard model operations.

loci-review · 2025-11-10T21:14:11Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of PR #94 implementing 3D tensor support in repack matrix multiplication reveals measurable performance impacts in the CPU backend, with the primary changes affecting quantized matrix operations rather than core inference functions.

Key Findings

Performance Metrics:

Highest Throughput degradation: forward_mul_mat function shows 44% increase in self-time (2,519 ns → 3,638 ns)
Highest Response Time degradation: ggml_get_op_params_i32 shows 18% increase (71 ns → 85 ns)

Core Function Impact:
The changes do not directly affect primary inference functions (llama_decode, llama_encode, llama_tokenize). The modified forward_mul_mat function operates in the GGML CPU backend for quantized operations, not the main inference pipeline. Therefore, tokens per second performance remains unaffected for standard inference workloads.

Power Consumption:

build.bin.libggml-cpu.so shows 1.28% power consumption increase (+1,928 nJ)
All other binaries show no measurable power changes
Increase correlates with additional CPU cycles in quantized matrix operations

Technical Analysis:

Flame Graph: ggml_get_op_params_i32 degradation attributed to assertion handling overhead (8.3% of execution time spent in __assert_fail preparation)
CFG Comparison: Performance regression stems from memory layout changes in assertion string literals, causing cache misses and increased memory access latency
Code Review: PR successfully adds 3D tensor support for batched models (LFM2) while introducing computational overhead from batch index calculations and enhanced pointer arithmetic

Affected Components:

Quantized matrix multiplication path in CPU backend
Parameter validation functions with enhanced safety checks
Memory allocation patterns for batched operations

Actionable Recommendations:

Monitor memory usage patterns with batched models to ensure allocation sufficiency
Consider SIMD optimization for batch index calculations to reduce the 44% throughput overhead
Evaluate compile-time assertion elimination in performance-critical paths

The changes successfully address correctness for 3D tensor operations while maintaining inference performance for standard workloads.

DajanaV temporarily deployed to PROD__AL_DEMO November 5, 2025 17:36 — with GitHub Actions Inactive

ggml-cpu: handle 3d tensors in repack mul_mat

950671d

DajanaV force-pushed the main branch from 059da62 to 948dcfd Compare November 5, 2025 18:11

Removed unnecessary branch, removed need for <algorithm>

0b86651

DajanaV force-pushed the upstream-PR17030-branch_Alcpz-Alcpz/batched_repack_mul_mat branch from eadb483 to 0b86651 Compare November 5, 2025 18:41

DajanaV temporarily deployed to PROD__AL_DEMO November 5, 2025 18:42 — with GitHub Actions Inactive

DajanaV force-pushed the main branch from 948dcfd to 6f3825c Compare November 5, 2025 19:07

DajanaV force-pushed the main branch 3 times, most recently from b1ace60 to bff7103 Compare November 6, 2025 08:11

Fixed dst_ptr pointer in chunk + clang_format

75c7fd5

DajanaV force-pushed the main branch 17 times, most recently from 733e776 to 2c7fec2 Compare November 9, 2025 07:08

DajanaV force-pushed the main branch from a29809a to 973f45e Compare November 10, 2025 21:08

DajanaV force-pushed the main branch 28 times, most recently from 9ea0205 to 1308d3f Compare November 14, 2025 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17030: ggml-cpu: handle 3d tensors in repack mat_mul#94

UPSTREAM PR #17030: ggml-cpu: handle 3d tensors in repack mat_mul#94
DajanaV wants to merge 6 commits intomainfrom
upstream-PR17030-branch_Alcpz-Alcpz/batched_repack_mul_mat

DajanaV commented Nov 5, 2025

Uh oh!

loci-review bot commented Nov 5, 2025

Uh oh!

loci-review bot commented Nov 5, 2025

Uh oh!

loci-review bot commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DajanaV commented Nov 5, 2025

Uh oh!

loci-review bot commented Nov 5, 2025

Performance Analysis Summary: PR #94 - 3D Tensor Support in Matrix Multiplication

Overview

Key Findings

Uh oh!

loci-review bot commented Nov 5, 2025

Performance Analysis Summary: PR #94 - 3D Tensor Support

Overview

Key Findings

Uh oh!

loci-review bot commented Nov 10, 2025

Performance Analysis Summary

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants