UPSTREAM PR #17644: model: support Ministral3 by loci-dev · Pull Request #387 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-01T10:41:08Z

Mirrored from ggml-org/llama.cpp#17644

Ref upstream PR: huggingface/transformers#42498

Disclosure: This PR is made with collaboration from Mistral. Huge thanks to @juliendenize for coordination!

Note: The model weight is not yet released

…ral3

loci-review · 2025-12-01T11:29:44Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #387

PR Context: Adds Ministral3 model support (upstream PR #17644) with new architecture definition, YARN RoPE scaling, and attention temperature scaling feature.

Changes: 11 files modified, 342 additions, 10 deletions. New model architecture (LLM_ARCH_MISTRAL3) with dedicated graph builder and hyperparameter handling.

Analysis Classification: Condition 3

The performance changes show high percentages but minimal absolute impact. The analyzed functions are primarily STL utilities and graph input helpers, not core inference functions.

Key Findings

Most-Impacted Functions

Graph Input Processing:

set_input (llm_graph_input_attn_temp): +201 ns response time, +170 ns throughput
- Added two GGML_ASSERT statements validating f_attn_temp_scale and n_attn_temp_floor_scale
- Assertions execute before attention scale calculation loop
- Change implements safety checks for new Ministral3 attention temperature feature

STL Container Operations:

empty (vector): +133 ns throughput (+204%)
back (vector): +29 ns throughput (+69%)
operator= (shared_ptr): +25 ns throughput (+45%)

These are compiler-generated functions showing optimization variance, not source code changes.

KV Cache Operations (Improvements):

ext_get: -29 ns response time
ext_set: -12 ns response time
get_shift: -21 ns throughput

Inference Impact (Tokens per Second)

No impact on tokens per second. The core inference functions (llama_decode, llama_encode, llama_tokenize) show no changes in this analysis. The modified functions are:

Graph input helpers (set_input): Used during graph construction, not per-token execution
STL utilities: Compiler-generated code with negligible absolute overhead
KV cache accessors: Show improvements, not degradations

The +201 ns in set_input occurs once per batch setup, not per token. For a 128-token batch, this adds 1.6 ns per token, which is negligible.

Power Consumption Analysis

Binary: build.bin.libllama.so

Power consumption: 193,765 nJ (target) vs 193,122 nJ (base)
Change: +0.33% (+642 nJ)

Impact: Negligible. The 642 nJ increase represents cumulative throughput changes across all functions. Other binaries show 0.0% change.

Code Change Summary

llama-hparams.h:

Changed default values: n_attn_temp_floor_scale from 8192 to 0, f_attn_temp_scale from 0.1 to 0.0
Purpose: Make attention temperature scaling opt-in rather than default

llama-graph.cpp:

Added assertions in set_input to validate non-zero temperature parameters
Purpose: Prevent division by zero when temperature scaling is enabled

llama-model.cpp:

Moved LLAMA arch default initialization logic
Added MISTRAL3 arch case with temperature scaling parameter loading
Purpose: Support new Ministral3 model format with YARN RoPE and temperature scaling

New File: models/mistral3.cpp (160 lines)

Implements llm_build_mistral3 graph builder
Includes conditional attention temperature scaling: if (inp_attn_scale) { Qcur = ggml_mul(ctx0, Qcur, inp_attn_scale); }
Purpose: Build computation graph for Ministral3 architecture

The performance metrics reflect compilation variance in STL operations rather than algorithmic degradation. The intentional code changes (assertions, architecture support) add minimal overhead and enable new model support.

ngxson and others added 9 commits November 25, 2025 14:23

conversion script

3e41c14

support ministral 3

2b2f411

maybe this is better?

4cebf7b

add TODO for rope_yarn_log_mul

84be00f

better ppl (tested on 14B-Instruct)

786b3f8

Merge remote-tracking branch 'mistral/xsn/ministral3' into xsn/minist…

55a196f

…ral3

Add Ministral3 support to Mistral format

a4f540b

improve arch handling

bf08fcc

add sizes

34234a5

loci-dev temporarily deployed to PROD__AL_DEMO December 1, 2025 10:41 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 50d76f4 to cbd9848 Compare December 1, 2025 11:08

loci-dev force-pushed the main branch 18 times, most recently from 1c3cc79 to 0332e09 Compare December 2, 2025 21:09

loci-dev force-pushed the main branch 30 times, most recently from 1bd5bdc to 32aa2bc Compare December 8, 2025 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17644: model: support Ministral3#387

UPSTREAM PR #17644: model: support Ministral3#387
loci-dev wants to merge 9 commits intomainfrom
upstream-PR17644-branch_ngxson-xsn/ministral3

loci-dev commented Dec 1, 2025

Uh oh!

loci-review bot commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

loci-dev commented Dec 1, 2025

Uh oh!

loci-review bot commented Dec 1, 2025

Performance Analysis Summary - PR #387

Analysis Classification: Condition 3

Key Findings

Most-Impacted Functions

Inference Impact (Tokens per Second)

Power Consumption Analysis

Code Change Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants