Skip to content

UPSTREAM PR #17644: model: support Ministral3#387

Open
loci-dev wants to merge 9 commits intomainfrom
upstream-PR17644-branch_ngxson-xsn/ministral3
Open

UPSTREAM PR #17644: model: support Ministral3#387
loci-dev wants to merge 9 commits intomainfrom
upstream-PR17644-branch_ngxson-xsn/ministral3

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Dec 1, 2025

Mirrored from ggml-org/llama.cpp#17644

Ref upstream PR: huggingface/transformers#42498

Disclosure: This PR is made with collaboration from Mistral. Huge thanks to @juliendenize for coordination!

Note: The model weight is not yet released

@loci-review
Copy link

loci-review bot commented Dec 1, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #387

PR Context: Adds Ministral3 model support (upstream PR #17644) with new architecture definition, YARN RoPE scaling, and attention temperature scaling feature.

Changes: 11 files modified, 342 additions, 10 deletions. New model architecture (LLM_ARCH_MISTRAL3) with dedicated graph builder and hyperparameter handling.


Analysis Classification: Condition 3

The performance changes show high percentages but minimal absolute impact. The analyzed functions are primarily STL utilities and graph input helpers, not core inference functions.


Key Findings

Most-Impacted Functions

Graph Input Processing:

  • set_input (llm_graph_input_attn_temp): +201 ns response time, +170 ns throughput
    • Added two GGML_ASSERT statements validating f_attn_temp_scale and n_attn_temp_floor_scale
    • Assertions execute before attention scale calculation loop
    • Change implements safety checks for new Ministral3 attention temperature feature

STL Container Operations:

  • empty (vector): +133 ns throughput (+204%)
  • back (vector): +29 ns throughput (+69%)
  • operator= (shared_ptr): +25 ns throughput (+45%)

These are compiler-generated functions showing optimization variance, not source code changes.

KV Cache Operations (Improvements):

  • ext_get: -29 ns response time
  • ext_set: -12 ns response time
  • get_shift: -21 ns throughput

Inference Impact (Tokens per Second)

No impact on tokens per second. The core inference functions (llama_decode, llama_encode, llama_tokenize) show no changes in this analysis. The modified functions are:

  • Graph input helpers (set_input): Used during graph construction, not per-token execution
  • STL utilities: Compiler-generated code with negligible absolute overhead
  • KV cache accessors: Show improvements, not degradations

The +201 ns in set_input occurs once per batch setup, not per token. For a 128-token batch, this adds 1.6 ns per token, which is negligible.

Power Consumption Analysis

Binary: build.bin.libllama.so

  • Power consumption: 193,765 nJ (target) vs 193,122 nJ (base)
  • Change: +0.33% (+642 nJ)

Impact: Negligible. The 642 nJ increase represents cumulative throughput changes across all functions. Other binaries show 0.0% change.

Code Change Summary

llama-hparams.h:

  • Changed default values: n_attn_temp_floor_scale from 8192 to 0, f_attn_temp_scale from 0.1 to 0.0
  • Purpose: Make attention temperature scaling opt-in rather than default

llama-graph.cpp:

  • Added assertions in set_input to validate non-zero temperature parameters
  • Purpose: Prevent division by zero when temperature scaling is enabled

llama-model.cpp:

  • Moved LLAMA arch default initialization logic
  • Added MISTRAL3 arch case with temperature scaling parameter loading
  • Purpose: Support new Ministral3 model format with YARN RoPE and temperature scaling

New File: models/mistral3.cpp (160 lines)

  • Implements llm_build_mistral3 graph builder
  • Includes conditional attention temperature scaling: if (inp_attn_scale) { Qcur = ggml_mul(ctx0, Qcur, inp_attn_scale); }
  • Purpose: Build computation graph for Ministral3 architecture

The performance metrics reflect compilation variance in STL operations rather than algorithmic degradation. The intentional code changes (assertions, architecture support) add minimal overhead and enable new model support.

@loci-dev loci-dev force-pushed the main branch 18 times, most recently from 1c3cc79 to 0332e09 Compare December 2, 2025 21:09
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 1bd5bdc to 32aa2bc Compare December 8, 2025 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants