Throughput improvement for small batch sizes by uttampc1 · Pull Request #17342 · ggml-org/llama.cpp

uttampc1 · 2025-11-18T01:09:15Z

I came across a core scaling issue while running llama-bench on large core count machine and small batch sizes. During the investigation I found cache-line contention causing scaling issue. This patch fixes the contention.

With this patch I've seen throughput improvement ranging from 2% to 44% while running with Qwen3 30B parameter model.

Results are obtained with following command,
$ llama-bench –m Qwen3-Coder-30B-A3B-Instruct-Q4_K_S.gguf -p 500 -n 0 -t n -b 16,32,64,96,128
Where "n" is number of threads.

Here are the results:

Constant batch Size=16 with varying number of threads

	Patched/Baseline
Threads	TPS
1	1.00
2	1.00
4	1.00
8	1.00
16	`1.03`
32	`1.09`
64	`1.16`
96	`1.20`

Constant number of threads=96 with varying batch size

	Patched/Baseline
Batch Size	TPS
1	1.00
2	`1.44`
4	`1.34`
8	`1.27`
16	`1.20`
32	`1.16`
64	`1.11`
96	`1.07`
128	`1.05`
512	`1.02`
1024	`1.02`

==== Test Results =====

tee -a /tmp/results/ctest_debug-ctest.log
ctest --output-on-failure -L main -E 'test-opt|test-backend-ops'
Test project llama.cpp/build-ci-debug
Start 1: test-tokenizer-0-bert-bge
1/35 Test Merging tensors of larger models #1: test-tokenizer-0-bert-bge ......... Passed 0.11 sec
Start 2: test-tokenizer-0-command-r
2/35 Test Windows VS2022 Build - Returning nonsense #2: test-tokenizer-0-command-r ........ Passed 1.37 sec
Start 3: test-tokenizer-0-deepseek-coder
3/35 Test Add missing headers for memcpy and assert #3: test-tokenizer-0-deepseek-coder ... Passed 0.24 sec
Start 4: test-tokenizer-0-deepseek-llm
4/35 Test Repetition penalty #4: test-tokenizer-0-deepseek-llm ..... Passed 0.60 sec
Start 5: test-tokenizer-0-falcon
5/35 Test Suppress output that isn't from the model #5: test-tokenizer-0-falcon ........... Passed 0.35 sec
Start 6: test-tokenizer-0-gpt-2
6/35 Test Include Python dependencies in README #6: test-tokenizer-0-gpt-2 ............ Passed 0.27 sec
Start 7: test-tokenizer-0-llama-bpe
7/35 Test Make run without error but ./model folder is empty #7: test-tokenizer-0-llama-bpe ........ Passed 0.90 sec
Start 8: test-tokenizer-0-llama-spm
8/35 Test Is there a requirements.txt ? #8: test-tokenizer-0-llama-spm ........ Passed 0.10 sec
Start 9: test-tokenizer-0-mpt
9/35 Test GPTQ Quantization (3-bit and 4-bit) #9: test-tokenizer-0-mpt .............. Passed 0.27 sec
Start 10: test-tokenizer-0-phi-3
10/35 Test simde? #10: test-tokenizer-0-phi-3 ............ Passed 0.10 sec
Start 11: test-tokenizer-0-qwen2
11/35 Test Unicode support #11: test-tokenizer-0-qwen2 ............ Passed 0.95 sec
Start 12: test-tokenizer-0-refact
12/35 Test Segfault / Memory error with 65B model (128GB RAM) #12: test-tokenizer-0-refact ........... Passed 0.27 sec
Start 13: test-tokenizer-0-starcoder
13/35 Test [Q] Memory Requirements for Different Model Sizes #13: test-tokenizer-0-starcoder ........ Passed 0.27 sec
Start 14: test-tokenizers-ggml-vocabs
14/35 Test tensor 'tok_embeddings.weight' has wrong size in model file #14: test-tokenizers-ggml-vocabs ....... Passed 6.92 sec
Start 15: test-sampling
15/35 Test Output is garbage in INT4 model in Mac M1 Max #15: test-sampling ..................... Passed 3.66 sec
Start 16: test-grammar-parser
16/35 Test Fix a typo in model name #16: test-grammar-parser ............... Passed 0.00 sec
Start 17: test-grammar-integration
17/35 Test Add oneliner for batch quantization #17: test-grammar-integration .......... Passed 0.02 sec
Start 18: test-llama-grammar
18/35 Test faster performance on older machines #18: test-llama-grammar ................ Passed 0.00 sec
Start 19: test-chat
19/35 Test Implement Flash Attention Option #19: test-chat ......................... Passed 7.52 sec
Start 20: test-json-schema-to-grammar
20/35 Test Feature/repeat penalty #20: test-json-schema-to-grammar ....... Passed 1.50 sec
Start 21: test-tokenizer-1-llama-spm
21/35 Test Add LICENSE #21: test-tokenizer-1-llama-spm ........ Passed 0.44 sec
Start 22: test-chat-parser
22/35 Test Windows 64-bit, Microsoft Visual Studio - it works like a charm after those fixes! #22: test-chat-parser .................. Passed 0.01 sec
Start 23: test-chat-template
23/35 Test Ability for ./main to keep the model in memory and pass it more text #23: test-chat-template ................ Passed 0.70 sec
Start 24: test-json-partial
24/35 Test 13b model issue tensor 'tok_embeddings.weight' has wrong size in model file #24: test-json-partial ................. Passed 0.01 sec
Start 25: test-log
25/35 Test Remove unprintable characters from vocab list #25: test-log .......................... Passed 0.02 sec
Start 26: test-regex-partial
26/35 Test Remove Unprintable #26: test-regex-partial ................ Passed 0.01 sec
Start 27: test-thread-safety
27/35 Test Fails to load 30B model after quantization #27: test-thread-safety ................ Passed 1.03 sec
Start 28: test-arg-parser
28/35 Test Too slow on m2 MBA 16gb SSD 512GB #28: test-arg-parser ................... Passed 0.25 sec
Start 29: test-gguf
29/35 Test ggml_new_tensor_impl: not enough space in the context's memory pool #29: test-gguf ......................... Passed 0.14 sec
Start 32: test-barrier
30/35 Test use weights_only in conversion script to prevent model arbitrary code execution #32: test-barrier ...................... Passed 1.44 sec
Start 33: test-quantize-fns
31/35 Test What is the meaning of hacked? #33: test-quantize-fns ................. Passed 16.97 sec
Start 34: test-quantize-perf
32/35 Test benchmarks? #34: test-quantize-perf ................ Passed 0.22 sec
Start 35: test-rope
33/35 Test convert-pth-to-ggml.py failed with RuntimeError #35: test-rope ......................... Passed 0.07 sec
Start 36: test-mtmd-c-api
34/35 Test Port to Visual C++. #36: test-mtmd-c-api ................... Passed 0.00 sec
Start 37: test-alloc
35/35 Test can't compile main #37: test-alloc ........................ Passed 0.00 sec

100% tests passed, 0 tests failed out of 35

Label Time Summary:
main = 46.75 sec*proc (35 tests)

I greatly appreciate any feedback anyone has to get this patch accepted.
Thanks.

…-line contention (cache HITM) This improves throughput for cases where threads have to wait due to lack work and causing process to spend many cycles in a spin loop. This enables to update dynamic chunk counter with static stride partitioning which further helps to eliminate shared counter. * remove one barrier in sgemm() * static stride partitioning

uttampc1 · 2025-12-05T19:35:45Z

Comment to get attention this PR.

uttampc1 · 2026-01-21T23:55:48Z

Friendly bump: Could someone take a look when you have a moment? This PR, passes CI as you can see from the nightly tests. Happy to adjust anything or rework it. Thanks!

ddh0 · 2026-01-22T00:13:43Z

cc @ggerganov

ggerganov · 2026-01-22T09:04:14Z

This loop is intentionally like this in order to work better with heterogeneous cores.

Also, we see only relative results, so we cannot conclude if this is better or not.

uttampc1 · 2026-02-05T20:24:49Z

@ggerganov Thanks — agreed the loop was designed for heterogeneous cores.

Clarification:
The regression on 96C at small batches is driven by OpenMP barrier contention (libgomp), specifically gomp_barrier_wait_start/end. Perf shows these dominating due to cache-line bouncing at high thread counts.
The patch removes one synchronization point and converts the tight while-loop into a deterministic for-loop that partitions work per thread inside the parallel region. This keeps the scheduling behavior intact but avoids hitting the hot libgomp barrier path in the inner loop.

On the “relative only” concern:
I’ll attach perf snapshots with environment details showing a clear drop in gomp_barrier_wait_* samples after the patch.
If that direction sounds acceptable, I’ll update the PR with the perf report. A sanity check on a hybrid client CPU from the community would help confirm no regressions.

uttampc1 · 2026-03-21T00:22:26Z

Additional information:
Following snippet from "perf c2c report" shows, false cache-line sharing contention before and after the patch.

Baseline

   Cache line 0:
   0       53       93        1        0        2      0x583d269c5ec0
   
   Cache line 0 details:
   100.00%  100.00%    0.00%    0.00%    0.00%                 0x0     0       1      0x736978b8bb3e      4579      4271         0      146        79  [.] gomp_barrier_wait_start     libgomp.so.1.0.0  bar.h:98      0
     0.00%    0.00%  100.00%    0.00%  100.00%                 0x0     0       1      0x736978b8befd        20         4         0        3         3  [.] gomp_team_barrier_wait_end  libgomp.so.1.0.0  bar.c:91      0

With patched version,

  Cache line 0:
  0       53       57        0        0        0      0x596be90f6580  
   
 Cache line 0 details:
 100.00%  100.00%    0.00%    0.00%    0.00%                 0x0     0       1      0x73482c7c6b3e      4267      3904         0      110        60  [.] gomp_barrier_wait_start  libgomp.so.1.0.0  bar.h:98      0

As you can see from the report, with patched version the cache-line contention is completely gone.

uttampc1 requested review from ggerganov and slaren as code owners November 18, 2025 01:09

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 18, 2025

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes auroralabs-loci/llama.cpp#248

Open

loci-dev mentioned this pull request Mar 21, 2026

UPSTREAM PR #17342: Throughput improvement for small batch sizes auroralabs-loci/llama.cpp#1279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throughput improvement for small batch sizes#17342

Throughput improvement for small batch sizes#17342
uttampc1 wants to merge 1 commit intoggml-org:masterfrom
uttampc1:core-scaling-opt

uttampc1 commented Nov 18, 2025

Uh oh!

uttampc1 commented Dec 5, 2025

Uh oh!

uttampc1 commented Jan 21, 2026

Uh oh!

ddh0 commented Jan 22, 2026

Uh oh!

ggerganov commented Jan 22, 2026

Uh oh!

uttampc1 commented Feb 5, 2026

Uh oh!

uttampc1 commented Mar 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

uttampc1 commented Nov 18, 2025

Uh oh!

uttampc1 commented Dec 5, 2025

Uh oh!

uttampc1 commented Jan 21, 2026

Uh oh!

ddh0 commented Jan 22, 2026

Uh oh!

ggerganov commented Jan 22, 2026

Uh oh!

uttampc1 commented Feb 5, 2026

Uh oh!

uttampc1 commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

uttampc1 commented Mar 21, 2026 •

edited

Loading