Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #387PR Context: Adds Ministral3 model support (upstream PR #17644) with new architecture definition, YARN RoPE scaling, and attention temperature scaling feature. Changes: 11 files modified, 342 additions, 10 deletions. New model architecture (LLM_ARCH_MISTRAL3) with dedicated graph builder and hyperparameter handling. Analysis Classification: Condition 3The performance changes show high percentages but minimal absolute impact. The analyzed functions are primarily STL utilities and graph input helpers, not core inference functions. Key FindingsMost-Impacted FunctionsGraph Input Processing:
STL Container Operations:
These are compiler-generated functions showing optimization variance, not source code changes. KV Cache Operations (Improvements):
Inference Impact (Tokens per Second)No impact on tokens per second. The core inference functions (llama_decode, llama_encode, llama_tokenize) show no changes in this analysis. The modified functions are:
The +201 ns in set_input occurs once per batch setup, not per token. For a 128-token batch, this adds 1.6 ns per token, which is negligible. Power Consumption AnalysisBinary: build.bin.libllama.so
Impact: Negligible. The 642 nJ increase represents cumulative throughput changes across all functions. Other binaries show 0.0% change. Code Change Summaryllama-hparams.h:
llama-graph.cpp:
llama-model.cpp:
New File: models/mistral3.cpp (160 lines)
The performance metrics reflect compilation variance in STL operations rather than algorithmic degradation. The intentional code changes (assertions, architecture support) add minimal overhead and enable new model support. |
1c3cc79 to
0332e09
Compare
1bd5bdc to
32aa2bc
Compare
Mirrored from ggml-org/llama.cpp#17644
Ref upstream PR: huggingface/transformers#42498
Disclosure: This PR is made with collaboration from Mistral. Huge thanks to @juliendenize for coordination!
Note: The model weight is not yet released