UPSTREAM PR #17945: models : fix the attn_factor for mistral3 graphs by loci-dev · Pull Request #526 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-11T20:37:35Z

cont #17644

Fix the adjustment of the RoPE attention factor based on:

https://github.com/huggingface/transformers/blob/6d00f6b0a5679c36510f203e4226e36f517c3032/src/transformers/modeling_rope_utils.py#L336-L348

loci-review · 2025-12-11T21:27:12Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #526

Overview

This PR implements a correctness fix for RoPE attention factor calculation in Mistral3 models. The changes remove 12 lines from llama-model.cpp and add 2 lines to mistral3.cpp, relocating model-specific attention scaling logic from generic model loading to the Mistral3 graph builder.

Key Findings

Code Changes Impact:
The modification replaces an incorrect generic YaRN attention factor calculation with a simplified, model-specific formula aligned with Hugging Face Transformers. The old implementation computed a ratio of two mscale values, while the new formula uses attn_factor = 1.0 / (1.0 + 0.1 * logf(1.0 / freq_scale)). This executes once during graph construction rather than model loading.

Performance-Critical Functions:
Analysis of the top 10 functions with response time changes shows improvements primarily in STL container operations unrelated to this PR. Notable changes include before_begin reducing response time by 112 ns (from 195 ns to 83 ns) and __val_comp_iter reducing by 132 ns (from 252 ns to 120 ns). The _Rb_tree constructor for RoPE scaling type maps regressed by 31 ns (from 178 ns to 209 ns), likely due to increased allocator complexity.

Inference Impact:
No core inference functions (llama_decode, llama_encode, llama_tokenize) were modified in this PR. The attention factor change affects only Mistral3 graph construction, executing once per context initialization. Since tokenization and decode paths remain unchanged, tokens per second throughput is unaffected. The RoPE attention correction improves output quality without impacting inference speed.

Power Consumption:
The build.bin.libllama.so binary shows a 0.071% reduction in power consumption (140 nJ decrease from 195495 nJ to 195356 nJ). The build.bin.llama-run binary shows 0.001% reduction. All other binaries remain unchanged. These minimal improvements reflect the removal of computation from model loading.

Conclusion:
This is a correctness fix with negligible performance impact. The observed STL container improvements are unrelated compiler optimizations. The RoPE attention factor correction enhances model accuracy for Mistral3 without affecting inference throughput.

models : fix the attn_factor for mistral3 graphs

1df2e90

loci-dev temporarily deployed to PROD__AL_DEMO December 11, 2025 20:37 — with GitHub Actions Inactive

cont : rework attn_factor correction logic

59b9e36

loci-dev had a problem deploying to PROD__AL_DEMO December 12, 2025 09:39 — with GitHub Actions Failure

loci-dev force-pushed the main branch 25 times, most recently from 45e0e28 to e9472cd Compare December 15, 2025 02:47

loci-dev force-pushed the main branch 30 times, most recently from 9f1f66d to ec69147 Compare December 19, 2025 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17945: models : fix the attn_factor for mistral3 graphs#526

UPSTREAM PR #17945: models : fix the attn_factor for mistral3 graphs#526
loci-dev wants to merge 2 commits intomainfrom
upstream-PR17945-branch_ggml-org-gg/mistral-fix-attn-factor

loci-dev commented Dec 11, 2025

Uh oh!

loci-review bot commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 11, 2025

Uh oh!

loci-review bot commented Dec 11, 2025

Performance Analysis Summary: PR #526

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants