bench : cache the llama_context state at computed depth by ggerganov · Pull Request #16944 · ggml-org/llama.cpp

ggerganov · 2025-11-02T16:45:59Z

Sample commands:

make -j && ./bin/llama-bench -m ../models/gpt-oss-20b/ggml-model-mxfp4.gguf -t 1 -fa 1 -b 16384 -ub 2048 -d 0,1024,2048,4096,8192,16384,32768 -n 32 -p 2048

make -j && ./bin/llama-bench -m ../models/qwen2.5-3b-coder/ggml-model-q4_k.gguf -fa 1 -d 1024 -p 512 -ctk f16,q8_0

JohannesGaessler · 2025-11-02T18:03:54Z

I should clarify, for a MoE model this is not going to work correctly. Because the expert selection depends on the numerical input values leaving the memory of the KV cache uninitialized is going to bias the results.

My understanding is that recently functionality was added to the server which swaps on-device KV cache with RAM. If that could be repurposed for an implementation like this it would I think already be fast enough:

Do depth run with default batch size of 512.
Save the state of the KV cache.
For each benchmark run, load the KV cache state first instead of re-calculating it.

ggerganov · 2025-11-03T12:21:45Z

@JohannesGaessler Good idea - pushed a version that I think does what you described.

JohannesGaessler

Thank you, this version seems to be working correctly. One caveat is that the order in which you do the batch sizes matters. Preferably you would run the large batch sizes first as they are going to process the first, uncached KV context much faster. However, the syntax -ub "1-512*2" will do the batch sizes in a suboptimal order (for my purposes this doesn't matter because I'm going to script it anyways). One solution would be to do the depth run always with a constant batch size but previously when we tried that that was causing issues (and I'm not sure it would be worth the opportunity cost to fix).

ggerganov · 2025-11-03T13:42:28Z

One solution would be to do the depth run always with a constant batch size but previously when we tried that that was causing issues (and I'm not sure it would be worth the opportunity cost to fix).

Can you point me to this previous attempt?

JohannesGaessler · 2025-11-03T13:48:20Z

#13096

ggerganov · 2025-11-04T19:45:00Z

One solution would be to do the depth run always with a constant batch size but previously when we tried that that was causing issues (and I'm not sure it would be worth the opportunity cost to fix).

Hm yeah, not sure what's the best way to do that.

slaren · 2025-11-05T00:22:26Z

tools/llama-bench/llama-bench.cpp

-                if (params.progress) {
-                    fprintf(stderr, "llama-bench: benchmark %d/%zu: depth run %d/%d\n", params_idx, params_count,
-                            i + 1, params.reps);
+                bool is_cached = t.n_depth == cstate.depth;


Wouldn't this need also to check if the model is the same, using the same KV type, or other parameters that may make the cache incompatible with the current test?

llama_state_seq_set_data should (in theory) return error (i.e. 0) when the state is incompatible with the current llama_context. I did try a few cases (different models, different KV cache types) and it seems to work as expected.

But it is a bit risky if somehow it's internal checks fail to detect an incompatibility, which can lead to invalid benches. So not sure - we can simply the logic to just reuse the state for the repetitions of the same test?

ggerganov · 2025-11-07T19:23:00Z

Did some extra tests with SSMs and it works as expected. Let's keep an eye out just in case - using --progress is useful for debugging this feature.

* bench : cache llama_context state at depth * cont : handle failures to restore the old state * cont : print information when the state is being reused

github-actions bot added the examples label Nov 2, 2025

ggerganov force-pushed the gg/context-skip-compute branch from e2f222c to 9e4cbd5 Compare November 3, 2025 11:25

ggerganov changed the title ~~llama : add option to skip the compute of a batch~~ bench : cache the llama_context state at computed depth Nov 3, 2025

ggerganov force-pushed the gg/context-skip-compute branch from 9e4cbd5 to 08a3c4a Compare November 3, 2025 12:21

JohannesGaessler approved these changes Nov 3, 2025

View reviewed changes

DajanaV mentioned this pull request Nov 3, 2025

UPSTREAM PR #16944: bench : cache the llama_context state at computed depth auroralabs-loci/llama.cpp#52

Closed

ggerganov force-pushed the gg/context-skip-compute branch from b5ce8df to f709a32 Compare November 3, 2025 14:07

ggerganov marked this pull request as ready for review November 4, 2025 19:45

ggerganov requested a review from slaren as a code owner November 4, 2025 19:45

slaren reviewed Nov 5, 2025

View reviewed changes

slaren approved these changes Nov 5, 2025

View reviewed changes

ggerganov added 3 commits November 7, 2025 20:17

bench : cache llama_context state at depth

7d2daf8

cont : handle failures to restore the old state

a99f7ce

cont : print information when the state is being reused

9c6bc80

ggerganov force-pushed the gg/context-skip-compute branch from a7bec56 to 9c6bc80 Compare November 7, 2025 18:17

ggerganov merged commit 7956bb4 into master Nov 7, 2025
61 of 66 checks passed

ggerganov deleted the gg/context-skip-compute branch November 7, 2025 19:23

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

bench : cache the llama_context state at computed depth (#16944)

d9e7774

* bench : cache llama_context state at depth * cont : handle failures to restore the old state * cont : print information when the state is being reused

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench : cache the llama_context state at computed depth#16944

bench : cache the llama_context state at computed depth#16944
ggerganov merged 3 commits intomasterfrom
gg/context-skip-compute

ggerganov commented Nov 2, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Nov 2, 2025

Uh oh!

ggerganov commented Nov 3, 2025

Uh oh!

JohannesGaessler left a comment

Uh oh!

ggerganov commented Nov 3, 2025

Uh oh!

JohannesGaessler commented Nov 3, 2025

Uh oh!

ggerganov commented Nov 4, 2025

Uh oh!

slaren Nov 5, 2025

Uh oh!

ggerganov Nov 5, 2025

Uh oh!

ggerganov commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ggerganov commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Nov 2, 2025

Uh oh!

ggerganov commented Nov 3, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Nov 3, 2025

Uh oh!

JohannesGaessler commented Nov 3, 2025

Uh oh!

ggerganov commented Nov 4, 2025

Uh oh!

slaren Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggerganov commented Nov 2, 2025 •

edited

Loading