Add batch warmup to sweep-bench by ikawrakow · Pull Request #375 · ikawrakow/ik_llama.cpp

ikawrakow · 2025-05-04T08:30:51Z

When using sweep-bench on CUDA, often the PP performance for N_KV = 0 (i.e., first PP run) is lower than the measured PP performance for N_KV > 0. My guess is that this is due to having to find and load from the cache of pre-compiled kernels the required once, which may take time that is not negligible compared to the time it takes the compute the batch. For an example, see the graph in PR #374.

To prevent this misleading result, this PR adds the ability to also use a warm-up run with n_ubatch tokens. The option is off by default as computing a batch on the CPU for a large model can take a significant amount of time (but the measured performance is not affected by having done a batch warmup run). To turn it on, use

./bin/llama-sweep-bench --warmup-batch (or -wb) other_arguments

saood06 · 2025-05-04T08:51:18Z

Wouldn't it make sense to make this a global warmup option across bench and common (see this commit for when I affected all off them 3702743 )? The only other thing is if you want the warmup MoE optimization of loading in all experts, then we would need to make the way that happens more robust as it is hacky and looks at it being exactly one token and that being the bos (as that would never happen normally), but a full batch is a normal occurence.

ikawrakow · 2025-05-04T09:24:18Z

Wouldn't it make sense to make this a global warmup option across bench and common

It would. The command line option is added to common, so the parameter is theoretically available to all examples using common. But I think improving warn-up in general could use a separate PR. Here I'm just addressing the need to have better benchmark results on CUDA (as I intend to add MMQ for all IQK quants).

saood06 · 2025-05-04T09:39:56Z

Wouldn't it make sense to make this a global warmup option across bench and common

It would. The command line option is added to common, so the parameter is theoretically available to all examples using common.

Yes but the implementation is done in sweep-bench.cpp not to common.cpp, you just added the command line option there, not the implementation (see the warmup implementation in common.cpp here:

ik_llama.cpp/common/common.cpp

Lines 2271 to 2305 in 1328128

    
               if (params.warmup) { 
        
                   LOG("warming up the model with an empty run\n"); 
        
                   std::vector<llama_token> tmp; 
        
                   llama_token bos = llama_token_bos(model); 
        
                   llama_token eos = llama_token_eos(model); 
        
                   // some models (e.g. T5) don't have a BOS token 
        
                   if (bos != -1) { 
        
                       tmp.push_back(bos); 
        
                   } 
        
           	else 
        
           	{ 
        
           	    tmp.push_back(eos); 
        
           	} 
        
                   if (llama_model_has_encoder(model)) { 
        
                       llama_encode(lctx, llama_batch_get_one(tmp.data(), tmp.size(), 0, 0)); 
        
                       llama_token decoder_start_token_id = llama_model_decoder_start_token(model); 
        
                       if (decoder_start_token_id == -1) { 
        
                           decoder_start_token_id = bos; 
        
                       } 
        
                       tmp.clear(); 
        
                       tmp.push_back(decoder_start_token_id); 
        
                   } 
        
                   if (llama_model_has_decoder(model)) { 
        
                       llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch), 0, 0)); 
        
                   } 
        
                   llama_kv_cache_clear(lctx); 
        
                   llama_synchronize(lctx); 
        
                   llama_reset_timings(lctx); 
        
               } 
        
               iparams.model   = model; 
        
               iparams.context = lctx; 
        
               return iparams; 
        
           }

)

Also you may as well address it in bench which does not use common.cpp (or I can if you want), as it should be simple and meaningful to address there.

But I think improving warn-up in general could use a separate PR. Here I'm just addressing the need to have better benchmark results on CUDA (as I intend to add MMQ for all IQK quants).

Yes I agree.

ikawrakow · 2025-05-04T12:22:35Z

Yes but the implementation is done in sweep-bench.cpp not to common.cpp, you just added the command line option there, not the implementation (see the warmup implementation in common.cpp here:

Yes, because I'm not sure what this unified warmup is going to be. If it ends up being the same or similar enough, one can reuse it in sweep-bench. But for now it is best if we don't touch the common warmup, thus affecting all examples.

Also you may as well address it in bench which does not use common.cpp (or I can if you want), as it should be simple and meaningful to address there.

llama-bench is a different animal. It uses a warmup that depends on the test being run. For PP it runs a batch, for TG it runs a single token, etc. Apart from this there are repetitions, so one does not rely on a single measurement as sweep-bench does. And, if that's not enough, I can always do llama-bench -p 512,512 and discard the first result.

saood06 · 2025-05-04T12:39:59Z

Yes, because I'm not sure what this unified warmup is going to be. If it ends up being the same or similar enough, one can reuse it in sweep-bench. But for now it is best if we don't touch the common warmup, thus affecting all examples.

I was just using that as an example, it would be a separate batch_warmup. If you found something that solves the problem then it makes sense to be able to use it for all things that support common. There are times I would want it when launching a fully CUDA offloaded llama-server which uses common.

Also you may as well address it in bench which does not use common.cpp (or I can if you want), as it should be simple and meaningful to address there.

llama-bench is a different animal. It uses a warmup that depends on the test being run. For PP it runs a batch, for TG it runs a single token, etc. Apart from this there are repetitions, so one does not rely on a single measurement as sweep-bench does. And, if that's not enough, I can always do llama-bench -p 512,512 and discard the first result.

Yes, I often output the json because you can see all the results (and I am familiar with -r, and was thinking of adding that to sweep-bench eventually) But if it affects results here, wouldn't it affect things there? I was going to try and reproduce but I got side tracked porting Deci.

ubergarm · 2025-05-07T21:44:58Z

tl;dr;

👍

Just tested this and also made a quick-n-dirty adaption which works on mainline as well.

main

ik_llama.cpp/main@4084ca73

model=/mnt/astrodata/llm/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-mix-IQ4_K.gguf

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -fmoe \
  -fa \
  -ctk f16 -ctv f16 \
  -c 32768 \
  -ngl 99 \
  --threads 1

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    0.333 |  1538.11 |    1.228 |   104.21 |
|   512 |    128 |    512 |    0.303 |  1691.86 |    1.253 |   102.19 |
|   512 |    128 |   1024 |    0.308 |  1661.26 |    1.247 |   102.67 |
|   512 |    128 |   1536 |    0.309 |  1658.42 |    1.257 |   101.85 |
|   512 |    128 |   2048 |    0.322 |  1591.58 |    1.290 |    99.26 |
|   512 |    128 |   2560 |    0.313 |  1637.87 |    1.289 |    99.27 |
|   512 |    128 |   3072 |    0.321 |  1596.37 |    1.294 |    98.90 |
|   512 |    128 |   3584 |    0.319 |  1606.05 |    1.301 |    98.41 |

PR375

ik_llama.cpp/sweep_bench_warmup@a3975acd

model=/mnt/astrodata/llm/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-mix-IQ4_K.gguf

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -fmoe \
  -fa \
  -ctk f16 -ctv f16 \
  -c 32768 \
  -ngl 99 \
  --threads 1 \
  --warmup-batch

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    0.313 |  1635.74 |    1.235 |   103.67 |
|   512 |    128 |    512 |    0.306 |  1674.18 |    1.259 |   101.64 |
|   512 |    128 |   1024 |    0.306 |  1673.91 |    1.253 |   102.15 |
|   512 |    128 |   1536 |    0.317 |  1615.14 |    1.270 |   100.81 |
|   512 |    128 |   2048 |    0.310 |  1653.47 |    1.287 |    99.48 |
|   512 |    128 |   2560 |    0.314 |  1630.52 |    1.287 |    99.45 |
|   512 |    128 |   3072 |    0.316 |  1619.71 |    1.291 |    99.16 |
|   512 |    128 |   3584 |    0.318 |  1608.00 |    1.302 |    98.32 |

From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.

ubergarm · 2025-12-04T15:19:23Z

oooh jeeze... i gotta fix this, sorry for all the force push spam not sure why GH is like this 😅 💀

Add batch warmup to sweep-bench

a3975ac

ikawrakow requested a review from saood06 May 4, 2025 08:30

ikawrakow merged commit 1d2da7f into main May 12, 2025

ubergarm added a commit to ubergarm/llama.cpp that referenced this pull request Aug 14, 2025

Adapt batch warmup to sweep-bench

b01e231

From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.

ubergarm added a commit to ubergarm/llama.cpp that referenced this pull request Aug 17, 2025

Adapt batch warmup to sweep-bench

b6f9d32

From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.

ubergarm added a commit to ubergarm/llama.cpp that referenced this pull request Aug 24, 2025

Adapt batch warmup to sweep-bench

f70fef4

From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.

ubergarm added a commit to ubergarm/llama.cpp that referenced this pull request Sep 2, 2025

Adapt batch warmup to sweep-bench

7dcdb83

From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.

ubergarm added a commit to ubergarm/llama.cpp that referenced this pull request Sep 5, 2025

Adapt batch warmup to sweep-bench

1f9a72c

From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.

ubergarm added a commit to ubergarm/llama.cpp that referenced this pull request Sep 11, 2025

Adapt batch warmup to sweep-bench

fa7e024

From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.

ubergarm added a commit to ubergarm/llama.cpp that referenced this pull request Oct 1, 2025

Adapt batch warmup to sweep-bench

8ce128b

From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.

ubergarm added a commit to ubergarm/llama.cpp that referenced this pull request Nov 19, 2025

Adapt batch warmup to sweep-bench

b44c547

From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.

ubergarm added a commit to ubergarm/llama.cpp that referenced this pull request Dec 2, 2025

Adapt batch warmup to sweep-bench

771281f

From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.

ubergarm added a commit to ubergarm/llama.cpp that referenced this pull request Dec 4, 2025

Adapt batch warmup to sweep-bench

061312e

From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batch warmup to sweep-bench#375

Add batch warmup to sweep-bench#375
ikawrakow merged 1 commit intomainfrom
ik/sweep_bench_warmup

ikawrakow commented May 4, 2025

Uh oh!

saood06 commented May 4, 2025 •

edited

Loading

Uh oh!

ikawrakow commented May 4, 2025

Uh oh!

saood06 commented May 4, 2025

Uh oh!

ikawrakow commented May 4, 2025

Uh oh!

saood06 commented May 4, 2025

Uh oh!

ubergarm commented May 7, 2025

Uh oh!

ubergarm commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ikawrakow commented May 4, 2025

Uh oh!

saood06 commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented May 4, 2025

Uh oh!

saood06 commented May 4, 2025

Uh oh!

ikawrakow commented May 4, 2025

Uh oh!

saood06 commented May 4, 2025

Uh oh!

ubergarm commented May 7, 2025

tl;dr;

main

PR375

Uh oh!

ubergarm commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saood06 commented May 4, 2025 •

edited

Loading