bench : cache the llama_context state at computed depth#16944
bench : cache the llama_context state at computed depth#16944
Conversation
|
I should clarify, for a MoE model this is not going to work correctly. Because the expert selection depends on the numerical input values leaving the memory of the KV cache uninitialized is going to bias the results. My understanding is that recently functionality was added to the server which swaps on-device KV cache with RAM. If that could be repurposed for an implementation like this it would I think already be fast enough:
|
e2f222c to
9e4cbd5
Compare
9e4cbd5 to
08a3c4a
Compare
|
@JohannesGaessler Good idea - pushed a version that I think does what you described. |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Thank you, this version seems to be working correctly. One caveat is that the order in which you do the batch sizes matters. Preferably you would run the large batch sizes first as they are going to process the first, uncached KV context much faster. However, the syntax -ub "1-512*2" will do the batch sizes in a suboptimal order (for my purposes this doesn't matter because I'm going to script it anyways). One solution would be to do the depth run always with a constant batch size but previously when we tried that that was causing issues (and I'm not sure it would be worth the opportunity cost to fix).
Can you point me to this previous attempt? |
b5ce8df to
f709a32
Compare
Hm yeah, not sure what's the best way to do that. |
| if (params.progress) { | ||
| fprintf(stderr, "llama-bench: benchmark %d/%zu: depth run %d/%d\n", params_idx, params_count, | ||
| i + 1, params.reps); | ||
| bool is_cached = t.n_depth == cstate.depth; |
There was a problem hiding this comment.
Wouldn't this need also to check if the model is the same, using the same KV type, or other parameters that may make the cache incompatible with the current test?
There was a problem hiding this comment.
llama_state_seq_set_data should (in theory) return error (i.e. 0) when the state is incompatible with the current llama_context. I did try a few cases (different models, different KV cache types) and it seems to work as expected.
But it is a bit risky if somehow it's internal checks fail to detect an incompatibility, which can lead to invalid benches. So not sure - we can simply the logic to just reuse the state for the repetitions of the same test?
a7bec56 to
9c6bc80
Compare
|
Did some extra tests with SSMs and it works as expected. Let's keep an eye out just in case - using |
* bench : cache llama_context state at depth * cont : handle failures to restore the old state * cont : print information when the state is being reused
* bench : cache llama_context state at depth * cont : handle failures to restore the old state * cont : print information when the state is being reused
See #16944 (comment)
Sample commands:
make -j && ./bin/llama-bench -m ../models/gpt-oss-20b/ggml-model-mxfp4.gguf -t 1 -fa 1 -b 16384 -ub 2048 -d 0,1024,2048,4096,8192,16384,32768 -n 32 -p 2048make -j && ./bin/llama-bench -m ../models/qwen2.5-3b-coder/ggml-model-q4_k.gguf -fa 1 -d 1024 -p 512 -ctk f16,q8_0