Conversation
|
Wouldn't it make sense to make this a global warmup option across bench and common (see this commit for when I affected all off them 3702743 )? The only other thing is if you want the warmup MoE optimization of loading in all experts, then we would need to make the way that happens more robust as it is hacky and looks at it being exactly one token and that being the bos (as that would never happen normally), but a full batch is a normal occurence. |
It would. The command line option is added to |
Yes but the implementation is done in sweep-bench.cpp not to common.cpp, you just added the command line option there, not the implementation (see the warmup implementation in common.cpp here: ik_llama.cpp/common/common.cpp Lines 2271 to 2305 in 1328128 Also you may as well address it in bench which does not use common.cpp (or I can if you want), as it should be simple and meaningful to address there.
Yes I agree. |
Yes, because I'm not sure what this unified warmup is going to be. If it ends up being the same or similar enough, one can reuse it in
|
I was just using that as an example, it would be a separate
Yes, I often output the json because you can see all the results (and I am familiar with |
tl;dr;👍 Just tested this and also made a quick-n-dirty adaption which works on mainline as well. main
PR375
|
From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.
From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.
From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.
From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.
From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.
From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.
From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.
From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.
From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.
From ikawrakow/ik_llama.cpp#375 Hardcoded to true to always run to avoid adding more arguments.
|
oooh jeeze... i gotta fix this, sorry for all the force push spam not sure why GH is like this 😅 💀 |
When using
sweep-benchon CUDA, often the PP performance forN_KV = 0(i.e., first PP run) is lower than the measured PP performance forN_KV > 0. My guess is that this is due to having to find and load from the cache of pre-compiled kernels the required once, which may take time that is not negligible compared to the time it takes the compute the batch. For an example, see the graph in PR #374.To prevent this misleading result, this PR adds the ability to also use a warm-up run with
n_ubatchtokens. The option is off by default as computing a batch on the CPU for a large model can take a significant amount of time (but the measured performance is not affected by having done a batch warmup run). To turn it on, use