Skip to content

Throughput improvement for small batch sizes#17342

Open
uttampc1 wants to merge 1 commit intoggml-org:masterfrom
uttampc1:core-scaling-opt
Open

Throughput improvement for small batch sizes#17342
uttampc1 wants to merge 1 commit intoggml-org:masterfrom
uttampc1:core-scaling-opt

Conversation

@uttampc1
Copy link
Copy Markdown

I came across a core scaling issue while running llama-bench on large core count machine and small batch sizes. During the investigation I found cache-line contention causing scaling issue. This patch fixes the contention.

With this patch I've seen throughput improvement ranging from 2% to 44% while running with Qwen3 30B parameter model.

Results are obtained with following command,
$ llama-bench –m Qwen3-Coder-30B-A3B-Instruct-Q4_K_S.gguf -p 500 -n 0 -t n -b 16,32,64,96,128
Where "n" is number of threads.

Here are the results:

  1. Constant batch Size=16 with varying number of threads
Patched/Baseline
Threads TPS
1 1.00
2 1.00
4 1.00
8 1.00
16 1.03
32 1.09
64 1.16
96 1.20
  1. Constant number of threads=96 with varying batch size
Patched/Baseline
Batch Size TPS
1 1.00
2 1.44
4 1.34
8 1.27
16 1.20
32 1.16
64 1.11
96 1.07
128 1.05
512 1.02
1024 1.02

==== Test Results =====

100% tests passed, 0 tests failed out of 35

Label Time Summary:
main = 46.75 sec*proc (35 tests)

I greatly appreciate any feedback anyone has to get this patch accepted.
Thanks.

…-line contention (cache HITM)

This improves throughput for cases where threads have to wait due to lack work and causing process
to spend many cycles in a spin loop. This enables to update dynamic chunk counter with static stride
partitioning which further helps to eliminate shared counter.

* remove one barrier in sgemm()

* static stride partitioning
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 18, 2025
@uttampc1
Copy link
Copy Markdown
Author

uttampc1 commented Dec 5, 2025

  • Comment to get attention this PR.

@uttampc1
Copy link
Copy Markdown
Author

Friendly bump: Could someone take a look when you have a moment? This PR, passes CI as you can see from the nightly tests. Happy to adjust anything or rework it. Thanks!

@ddh0
Copy link
Copy Markdown
Contributor

ddh0 commented Jan 22, 2026

cc @ggerganov

@ggerganov
Copy link
Copy Markdown
Member

This loop is intentionally like this in order to work better with heterogeneous cores.

Also, we see only relative results, so we cannot conclude if this is better or not.

@uttampc1
Copy link
Copy Markdown
Author

uttampc1 commented Feb 5, 2026

@ggerganov Thanks — agreed the loop was designed for heterogeneous cores.

Clarification:
The regression on 96C at small batches is driven by OpenMP barrier contention (libgomp), specifically gomp_barrier_wait_start/end. Perf shows these dominating due to cache-line bouncing at high thread counts.
The patch removes one synchronization point and converts the tight while-loop into a deterministic for-loop that partitions work per thread inside the parallel region. This keeps the scheduling behavior intact but avoids hitting the hot libgomp barrier path in the inner loop.

On the “relative only” concern:
I’ll attach perf snapshots with environment details showing a clear drop in gomp_barrier_wait_* samples after the patch.
If that direction sounds acceptable, I’ll update the PR with the perf report. A sanity check on a hybrid client CPU from the community would help confirm no regressions.

@uttampc1
Copy link
Copy Markdown
Author

uttampc1 commented Mar 21, 2026

Additional information:
Following snippet from "perf c2c report" shows, false cache-line sharing contention before and after the patch.

  • Baseline

       Cache line 0:
       0       53       93        1        0        2      0x583d269c5ec0
       
       Cache line 0 details:
       100.00%  100.00%    0.00%    0.00%    0.00%                 0x0     0       1      0x736978b8bb3e      4579      4271         0      146        79  [.] gomp_barrier_wait_start     libgomp.so.1.0.0  bar.h:98      0
         0.00%    0.00%  100.00%    0.00%  100.00%                 0x0     0       1      0x736978b8befd        20         4         0        3         3  [.] gomp_team_barrier_wait_end  libgomp.so.1.0.0  bar.c:91      0
    
  • With patched version,

      Cache line 0:
      0       53       57        0        0        0      0x596be90f6580  
       
     Cache line 0 details:
     100.00%  100.00%    0.00%    0.00%    0.00%                 0x0     0       1      0x73482c7c6b3e      4267      3904         0      110        60  [.] gomp_barrier_wait_start  libgomp.so.1.0.0  bar.h:98      0
    

As you can see from the report, with patched version the cache-line contention is completely gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants