UPSTREAM PR #17342: Throughput improvement for small batch sizes by loci-dev · Pull Request #1279 · auroralabs-loci/llama.cpp

loci-dev · 2026-03-21T02:59:33Z

Note

Source pull request: ggml-org/llama.cpp#17342

I came across a core scaling issue while running llama-bench on large core count machine and small batch sizes. During the investigation I found cache-line contention causing scaling issue. This patch fixes the contention.

With this patch I've seen throughput improvement ranging from 2% to 44% while running with Qwen3 30B parameter model.

Results are obtained with following command,
$ llama-bench –m Qwen3-Coder-30B-A3B-Instruct-Q4_K_S.gguf -p 500 -n 0 -t n -b 16,32,64,96,128
Where "n" is number of threads.

Here are the results:

Constant batch Size=16 with varying number of threads

	Patched/Baseline
Threads	TPS
1	1.00
2	1.00
4	1.00
8	1.00
16	`1.03`
32	`1.09`
64	`1.16`
96	`1.20`

Constant number of threads=96 with varying batch size

	Patched/Baseline
Batch Size	TPS
1	1.00
2	`1.44`
4	`1.34`
8	`1.27`
16	`1.20`
32	`1.16`
64	`1.11`
96	`1.07`
128	`1.05`
512	`1.02`
1024	`1.02`

==== Test Results =====

tee -a /tmp/results/ctest_debug-ctest.log
ctest --output-on-failure -L main -E 'test-opt|test-backend-ops'
Test project llama.cpp/build-ci-debug
Start 1: test-tokenizer-0-bert-bge
1/35 Test UPSTREAM PR #16634: metal : initial Metal4 tensor API support #1: test-tokenizer-0-bert-bge ......... Passed 0.11 sec
Start 2: test-tokenizer-0-command-r
2/35 Test UPSTREAM PR #16816: [bug fix] initialise buffer.device in ggml_hexagon_session #2: test-tokenizer-0-command-r ........ Passed 1.37 sec
Start 3: test-tokenizer-0-deepseek-coder
3/35 Test UPSTREAM PR #15805: Add conv2d Implicit GEMM #3: test-tokenizer-0-deepseek-coder ... Passed 0.24 sec
Start 4: test-tokenizer-0-deepseek-llm
4/35 Test UPSTREAM PR #16817: Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next) #4: test-tokenizer-0-deepseek-llm ..... Passed 0.60 sec
Start 5: test-tokenizer-0-falcon
5/35 Test UPSTREAM PR #16536: Vulkan MMQ Integer Dot Refactor and K-Quant support #5: test-tokenizer-0-falcon ........... Passed 0.35 sec
Start 6: test-tokenizer-0-gpt-2
6/35 Test UPSTREAM PR #16829: cpu: introduce chunking for flash attention #6: test-tokenizer-0-gpt-2 ............ Passed 0.27 sec
Start 7: test-tokenizer-0-llama-bpe
7/35 Test UPSTREAM PR #16828: CUDA: Conv2d tensor core #7: test-tokenizer-0-llama-bpe ........ Passed 0.90 sec
Start 8: test-tokenizer-0-llama-spm
8/35 Test UPSTREAM PR #15277: arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… #8: test-tokenizer-0-llama-spm ........ Passed 0.10 sec
Start 9: test-tokenizer-0-mpt
9/35 Test UPSTREAM PR #16831: Model: Minimax M2 #9: test-tokenizer-0-mpt .............. Passed 0.27 sec
Start 10: test-tokenizer-0-phi-3
10/35 Test UPSTREAM PR #16796: vulkan: Fix crash when FP16 mul_mat accumulation is not supported #10: test-tokenizer-0-phi-3 ............ Passed 0.10 sec
Start 11: test-tokenizer-0-qwen2
11/35 Test UPSTREAM PR #16833: cpu: introduce chunking for repack matmuls and enable matmul-id chunking #11: test-tokenizer-0-qwen2 ............ Passed 0.95 sec
Start 12: test-tokenizer-0-refact
12/35 Test UPSTREAM PR #16827: Massively Improved ROCm/HIP rocWMMA Performance (pp and tg) #12: test-tokenizer-0-refact ........... Passed 0.27 sec
Start 13: test-tokenizer-0-starcoder
13/35 Test UPSTREAM PR #16826: vulkan: remove the need for the dryrun #13: test-tokenizer-0-starcoder ........ Passed 0.27 sec
Start 14: test-tokenizers-ggml-vocabs
14/35 Test UPSTREAM PR #16872: vulkan: disable spirv-opt for rope shaders #14: test-tokenizers-ggml-vocabs ....... Passed 6.92 sec
Start 15: test-sampling
15/35 Test UPSTREAM PR #16868: vulkan: fuse mul_mat+add and mul_mat_id+add_id #15: test-sampling ..................... Passed 3.66 sec
Start 16: test-grammar-parser
16/35 Test UPSTREAM PR #16574: mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16: test-grammar-parser ............... Passed 0.00 sec
Start 17: test-grammar-integration
17/35 Test UPSTREAM PR #16858: CUDA: Remove unneded bias/gate dims in fused mmvq #17: test-grammar-integration .......... Passed 0.02 sec
Start 18: test-llama-grammar
18/35 Test UPSTREAM PR #16891: ggml-cpu : bicubic interpolation #18: test-llama-grammar ................ Passed 0.00 sec
Start 19: test-chat
19/35 Test UPSTREAM PR #16252: Refactor llama-model.cpp #19: test-chat ......................... Passed 7.52 sec
Start 20: test-json-schema-to-grammar
20/35 Test UPSTREAM PR #16896: model: add Granite Hybrid nano #20: test-json-schema-to-grammar ....... Passed 1.50 sec
Start 21: test-tokenizer-1-llama-spm
21/35 Test UPSTREAM PR #16884: CUDA: fuse rope + set_rows #21: test-tokenizer-1-llama-spm ........ Passed 0.44 sec
Start 22: test-chat-parser
22/35 Test UPSTREAM PR #16784: webui: auto-refresh /props on inference start to resync model metadata #22: test-chat-parser .................. Passed 0.01 sec
Start 23: test-chat-template
23/35 Test UPSTREAM PR #16757: webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe #23: test-chat-template ................ Passed 0.70 sec
Start 24: test-json-partial
24/35 Test UPSTREAM PR #16618: webui: add OAI-Compat Harmony tool-call streaming visualization and persistence in chat UI #24: test-json-partial ................. Passed 0.01 sec
Start 25: test-log
25/35 Test UPSTREAM PR #16900: Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support #25: test-log .......................... Passed 0.02 sec
Start 26: test-regex-partial
26/35 Test UPSTREAM PR #16899: vulkan: Fix multi_add invalid descriptor usage #26: test-regex-partial ................ Passed 0.01 sec
Start 27: test-thread-safety
27/35 Test UPSTREAM PR #16901: Add a setting to display message generation statistics #27: test-thread-safety ................ Passed 1.03 sec
Start 28: test-arg-parser
28/35 Test UPSTREAM PR #16805: ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 #28: test-arg-parser ................... Passed 0.25 sec
Start 29: test-gguf
29/35 Test UPSTREAM PR #16907: Vulkan: improve mul_mat_vec_iq1_m #29: test-gguf ......................... Passed 0.14 sec
Start 32: test-barrier
30/35 Test UPSTREAM PR #16906: model: add Janus Pro for image understanding #32: test-barrier ...................... Passed 1.44 sec
Start 33: test-quantize-fns
31/35 Test UPSTREAM PR #14891: imatrix: calculate activation-based statistics for new format (GGUF) imatrices #33: test-quantize-fns ................. Passed 16.97 sec
Start 34: test-quantize-perf
32/35 Test UPSTREAM PR #15550: quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error #34: test-quantize-perf ................ Passed 0.22 sec
Start 35: test-rope
33/35 Test UPSTREAM PR #16917: CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops #35: test-rope ......................... Passed 0.07 sec
Start 36: test-mtmd-c-api
34/35 Test UPSTREAM PR #16923: Add e2e tests for embedding raw flag #36: test-mtmd-c-api ................... Passed 0.00 sec
Start 37: test-alloc
35/35 Test UPSTREAM PR #16922: disable D_TYPE same as A_TYPE cast for rope shaders #37: test-alloc ........................ Passed 0.00 sec

100% tests passed, 0 tests failed out of 35

Label Time Summary:
main = 46.75 sec*proc (35 tests)

I greatly appreciate any feedback anyone has to get this patch accepted.
Thanks.

…-line contention (cache HITM) This improves throughput for cases where threads have to wait due to lack work and causing process to spend many cycles in a spin loop. This enables to update dynamic chunk counter with static stride partitioning which further helps to eliminate shared counter. * remove one barrier in sgemm() * static stride partitioning

loci-review · 2026-03-21T03:55:11Z

Overview

This PR introduces a 10.6-13.9% performance improvement in CPU-based GEMM operations through thread synchronization refactoring. A single commit by Uttam Pawar replaces dynamic work-stealing with static stride partitioning, eliminating cache-line contention (HITM) and reducing barrier synchronization overhead.

Function counts: 110,075 total | 37 modified | 0 new | 0 removed

Power consumption changes:

build.bin.libggml-cpu.so: -0.856% (143,715.22 → 142,484.82 nJ)
build.bin.llama-run: 0.0% (220,177.33 nJ)
build.bin.libllama.so: 0.0% (238,701.46 nJ)
build.bin.llama-cvector-generator: 0.0% (260,184.15 nJ)
build.bin.llama-tts: 0.0% (265,737.27 nJ)
build.bin.llama-bench: 0.0% (52,095.81 nJ)
build.bin.llama-gguf-split: 0.0% (32,187.78 nJ)
build.bin.llama-llava-cli: 0.0% (277.24 nJ)
build.bin.llama-minicpmv-cli: 0.0% (277.24 nJ)
build.bin.llama-quantize: 0.0% (35,720.97 nJ)
build.bin.llama-qwen2vl-cli: 0.0% (277.24 nJ)
build.bin.llama-gemma3-cli: 0.0% (277.24 nJ)
build.bin.llama-tokenize: 0.0% (30,651.30 nJ)
build.bin.libggml-base.so: 0.0% (73,344.48 nJ)
build.bin.libggml.so: 0.0% (5,048.69 nJ)
build.bin.libmtmd.so: 0.0% (166,961.71 nJ)

Function Analysis

All 37 modified functions are template instantiations of tinyBLAS::gemm in sgemm.cpp, differing only in tile size parameters. These are performance-critical functions in the inference hot path, consuming 70-90% of LLM inference time.

Top impacted functions (all in build.bin.libggml-cpu.so):

gemm<4,1,2>: Response time 1,821.60 → 1,567.61 ns (-13.94%), Throughput 570.70 → 561.12 ns (-1.68%)
gemm<4,1,1>: Response time 1,823.95 → 1,569.96 ns (-13.93%), Throughput 573.05 → 563.47 ns (-1.67%)
gemm<4,1,4>: Response time 1,826.30 → 1,572.31 ns (-13.91%), Throughput 575.40 → 565.82 ns (-1.66%)
gemm<4,4,2>: Response time 2,451.70 → 2,190.30 ns (-10.66%), Throughput 630.35 → 620.77 ns (-1.52%)
gemm<4,4,1>: Response time 2,454.81 → 2,193.42 ns (-10.65%), Throughput 633.47 → 623.89 ns (-1.51%)

Source code changes: The optimization replaces while (job < nb_job) { ggml_barrier(); job = ggml_threadpool_chunk_add(); } with for (job = ith; job < nb_job; job += nth) { ggml_barrier(); }. This eliminates atomic counter updates (61 ns), removes ARM atomic operations (44 ns), and reduces barrier calls from 2 to 1 per iteration (saves 190 ns), totaling ~295 ns overhead elimination. The 254-261 ns measured improvements align with eliminated overhead. Core SIMD computation remains unchanged.

Flame Graph Comparison

Selected function: gemm<4,1,2> (largest response time improvement, clearest structural changes)

Base version:

Target version:

The base version shows ggml_threadpool_chunk_add (61 ns) and __aarch64_ldadd4_relax atomic operation (44 ns) in the execution path, plus an early ggml_barrier call (190 ns). The target version eliminates all three, retaining only the final barrier synchronization. This structural simplification accounts for the 254 ns (13.94%) improvement.

Additional Findings

ML inference impact: This optimization directly benefits CPU-based LLM inference workloads. Expected end-to-end improvements: 7-13% faster time-to-first-token and per-token latency, with greater benefits on high core-count systems (16+ cores) where cache contention is more severe. GPU inference is unaffected as it uses separate GEMM implementations.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

loci-dev temporarily deployed to PROD__AL_DEMO March 21, 2026 02:59 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 9 times, most recently from d997939 to 8527fd7 Compare March 27, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17342: Throughput improvement for small batch sizes#1279

UPSTREAM PR #17342: Throughput improvement for small batch sizes#1279
loci-dev wants to merge 1 commit intomainfrom
loci/pr-17342-core-scaling-opt

loci-dev commented Mar 21, 2026

Uh oh!

loci-review bot commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Mar 21, 2026

Uh oh!

loci-review bot commented Mar 21, 2026

Overview

Function Analysis

Flame Graph Comparison

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants