Skip to content

UPSTREAM PR #17342: Throughput improvement for small batch sizes#1279

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-17342-core-scaling-opt
Open

UPSTREAM PR #17342: Throughput improvement for small batch sizes#1279
loci-dev wants to merge 1 commit intomainfrom
loci/pr-17342-core-scaling-opt

Conversation

@loci-dev
Copy link

Note

Source pull request: ggml-org/llama.cpp#17342

I came across a core scaling issue while running llama-bench on large core count machine and small batch sizes. During the investigation I found cache-line contention causing scaling issue. This patch fixes the contention.

With this patch I've seen throughput improvement ranging from 2% to 44% while running with Qwen3 30B parameter model.

Results are obtained with following command,
$ llama-bench –m Qwen3-Coder-30B-A3B-Instruct-Q4_K_S.gguf -p 500 -n 0 -t n -b 16,32,64,96,128
Where "n" is number of threads.

Here are the results:

  1. Constant batch Size=16 with varying number of threads
Patched/Baseline
Threads TPS
1 1.00
2 1.00
4 1.00
8 1.00
16 1.03
32 1.09
64 1.16
96 1.20
  1. Constant number of threads=96 with varying batch size
Patched/Baseline
Batch Size TPS
1 1.00
2 1.44
4 1.34
8 1.27
16 1.20
32 1.16
64 1.11
96 1.07
128 1.05
512 1.02
1024 1.02

==== Test Results =====

100% tests passed, 0 tests failed out of 35

Label Time Summary:
main = 46.75 sec*proc (35 tests)

I greatly appreciate any feedback anyone has to get this patch accepted.
Thanks.

…-line contention (cache HITM)

This improves throughput for cases where threads have to wait due to lack work and causing process
to spend many cycles in a spin loop. This enables to update dynamic chunk counter with static stride
partitioning which further helps to eliminate shared counter.

* remove one barrier in sgemm()

* static stride partitioning
@loci-review
Copy link

loci-review bot commented Mar 21, 2026

Overview

This PR introduces a 10.6-13.9% performance improvement in CPU-based GEMM operations through thread synchronization refactoring. A single commit by Uttam Pawar replaces dynamic work-stealing with static stride partitioning, eliminating cache-line contention (HITM) and reducing barrier synchronization overhead.

Function counts: 110,075 total | 37 modified | 0 new | 0 removed

Power consumption changes:

  • build.bin.libggml-cpu.so: -0.856% (143,715.22 → 142,484.82 nJ)
  • build.bin.llama-run: 0.0% (220,177.33 nJ)
  • build.bin.libllama.so: 0.0% (238,701.46 nJ)
  • build.bin.llama-cvector-generator: 0.0% (260,184.15 nJ)
  • build.bin.llama-tts: 0.0% (265,737.27 nJ)
  • build.bin.llama-bench: 0.0% (52,095.81 nJ)
  • build.bin.llama-gguf-split: 0.0% (32,187.78 nJ)
  • build.bin.llama-llava-cli: 0.0% (277.24 nJ)
  • build.bin.llama-minicpmv-cli: 0.0% (277.24 nJ)
  • build.bin.llama-quantize: 0.0% (35,720.97 nJ)
  • build.bin.llama-qwen2vl-cli: 0.0% (277.24 nJ)
  • build.bin.llama-gemma3-cli: 0.0% (277.24 nJ)
  • build.bin.llama-tokenize: 0.0% (30,651.30 nJ)
  • build.bin.libggml-base.so: 0.0% (73,344.48 nJ)
  • build.bin.libggml.so: 0.0% (5,048.69 nJ)
  • build.bin.libmtmd.so: 0.0% (166,961.71 nJ)

Function Analysis

All 37 modified functions are template instantiations of tinyBLAS::gemm in sgemm.cpp, differing only in tile size parameters. These are performance-critical functions in the inference hot path, consuming 70-90% of LLM inference time.

Top impacted functions (all in build.bin.libggml-cpu.so):

  • gemm<4,1,2>: Response time 1,821.60 → 1,567.61 ns (-13.94%), Throughput 570.70 → 561.12 ns (-1.68%)
  • gemm<4,1,1>: Response time 1,823.95 → 1,569.96 ns (-13.93%), Throughput 573.05 → 563.47 ns (-1.67%)
  • gemm<4,1,4>: Response time 1,826.30 → 1,572.31 ns (-13.91%), Throughput 575.40 → 565.82 ns (-1.66%)
  • gemm<4,4,2>: Response time 2,451.70 → 2,190.30 ns (-10.66%), Throughput 630.35 → 620.77 ns (-1.52%)
  • gemm<4,4,1>: Response time 2,454.81 → 2,193.42 ns (-10.65%), Throughput 633.47 → 623.89 ns (-1.51%)

Source code changes: The optimization replaces while (job < nb_job) { ggml_barrier(); job = ggml_threadpool_chunk_add(); } with for (job = ith; job < nb_job; job += nth) { ggml_barrier(); }. This eliminates atomic counter updates (61 ns), removes ARM atomic operations (44 ns), and reduces barrier calls from 2 to 1 per iteration (saves 190 ns), totaling ~295 ns overhead elimination. The 254-261 ns measured improvements align with eliminated overhead. Core SIMD computation remains unchanged.

Flame Graph Comparison

Selected function: gemm<4,1,2> (largest response time improvement, clearest structural changes)

Base version:

Base version flame graph

Target version:

Target version flame graph

The base version shows ggml_threadpool_chunk_add (61 ns) and __aarch64_ldadd4_relax atomic operation (44 ns) in the execution path, plus an early ggml_barrier call (190 ns). The target version eliminates all three, retaining only the final barrier synchronization. This structural simplification accounts for the 254 ns (13.94%) improvement.

Additional Findings

ML inference impact: This optimization directly benefits CPU-based LLM inference workloads. Expected end-to-end improvements: 7-13% faster time-to-first-token and per-token latency, with greater benefits on high core-count systems (16+ cores) where cache contention is more severe. GPU inference is unaffected as it uses separate GEMM implementations.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 9 times, most recently from d997939 to 8527fd7 Compare March 27, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants