Skip to content

UPSTREAM PR #17945: models : fix the attn_factor for mistral3 graphs#526

Open
loci-dev wants to merge 2 commits intomainfrom
upstream-PR17945-branch_ggml-org-gg/mistral-fix-attn-factor
Open

UPSTREAM PR #17945: models : fix the attn_factor for mistral3 graphs#526
loci-dev wants to merge 2 commits intomainfrom
upstream-PR17945-branch_ggml-org-gg/mistral-fix-attn-factor

Conversation

@loci-dev
Copy link

@loci-review
Copy link

loci-review bot commented Dec 11, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #526

Overview

This PR implements a correctness fix for RoPE attention factor calculation in Mistral3 models. The changes remove 12 lines from llama-model.cpp and add 2 lines to mistral3.cpp, relocating model-specific attention scaling logic from generic model loading to the Mistral3 graph builder.

Key Findings

Code Changes Impact:
The modification replaces an incorrect generic YaRN attention factor calculation with a simplified, model-specific formula aligned with Hugging Face Transformers. The old implementation computed a ratio of two mscale values, while the new formula uses attn_factor = 1.0 / (1.0 + 0.1 * logf(1.0 / freq_scale)). This executes once during graph construction rather than model loading.

Performance-Critical Functions:
Analysis of the top 10 functions with response time changes shows improvements primarily in STL container operations unrelated to this PR. Notable changes include before_begin reducing response time by 112 ns (from 195 ns to 83 ns) and __val_comp_iter reducing by 132 ns (from 252 ns to 120 ns). The _Rb_tree constructor for RoPE scaling type maps regressed by 31 ns (from 178 ns to 209 ns), likely due to increased allocator complexity.

Inference Impact:
No core inference functions (llama_decode, llama_encode, llama_tokenize) were modified in this PR. The attention factor change affects only Mistral3 graph construction, executing once per context initialization. Since tokenization and decode paths remain unchanged, tokens per second throughput is unaffected. The RoPE attention correction improves output quality without impacting inference speed.

Power Consumption:
The build.bin.libllama.so binary shows a 0.071% reduction in power consumption (140 nJ decrease from 195495 nJ to 195356 nJ). The build.bin.llama-run binary shows 0.001% reduction. All other binaries remain unchanged. These minimal improvements reflect the removal of computation from model loading.

Conclusion:
This is a correctness fix with negligible performance impact. The observed STL container improvements are unrelated compiler optimizations. The RoPE attention factor correction enhances model accuracy for Mistral3 without affecting inference throughput.

@loci-dev loci-dev force-pushed the main branch 25 times, most recently from 45e0e28 to e9472cd Compare December 15, 2025 02:47
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 9f1f66d to ec69147 Compare December 19, 2025 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants