Skip to content

UPSTREAM PR #18013: convert : refactor rope scaling handling#560

Open
loci-dev wants to merge 3 commits intomainfrom
upstream-PR18013-branch_ggml-org-cisc/convert-rope-parameters
Open

UPSTREAM PR #18013: convert : refactor rope scaling handling#560
loci-dev wants to merge 3 commits intomainfrom
upstream-PR18013-branch_ggml-org-cisc/convert-rope-parameters

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18013

Handle rope scaling in set_gguf_parameters to deduplicate code and support the new rope_parameters (where rope_theta also has moved) introduced in huggingface/transformers#39847

Obsoletes #18008

@loci-review
Copy link

loci-review bot commented Dec 14, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #560

Overview

This PR refactors RoPE scaling parameter handling in the HuggingFace-to-GGUF conversion script (convert_hf_to_gguf.py). The changes are purely architectural, consolidating 269 lines of duplicated RoPE configuration logic across 30+ model classes into a unified base class implementation. No runtime inference code or compiled binaries are modified.

Performance Impact

Zero performance impact on inference. The refactoring affects only the Python-based model conversion script, which executes before inference begins. The compiled binaries (libllama.so, libggml-cpu.so, etc.) remain unchanged, as confirmed by power consumption analysis showing 0.0% change across all 16 binaries.

Key inference functions unaffected:

  • llama_decode - No changes (tokenization/inference path intact)
  • llama_encode - No changes
  • llama_tokenize - No changes
  • ggml_mul_mat - No changes (matrix operations unchanged)
  • llama_graph_compute - No changes (computation graph unchanged)

Tokens per second: No impact. Since no inference-path functions are modified, token generation throughput remains identical to the baseline.

Power consumption: All binaries show 0.0% change, with maximum observed variance of ±0.0002% (measurement noise). Binaries analyzed: libllama.so (195,281 nJ), libggml-cpu.so (116,389 nJ), llama-run (220,352 nJ), and 13 others.

Code Changes

The refactoring introduces self.rope_parameters as a unified configuration dictionary in ModelBase.__init__(), normalizing RoPE parameter access across all model architectures. The base class set_gguf_parameters() now handles five RoPE scaling types (linear, yarn, longrope, dynamic, llama3) that were previously duplicated in individual model classes. Model-specific implementations (Llama2Model, Qwen2Model, Ministral3Model, etc.) now delegate common parameter handling to the base class via super().set_gguf_parameters(), retaining only architecture-specific logic.

The changes improve code maintainability and support the new rope_parameters format from HuggingFace Transformers while maintaining full backward compatibility with existing rope_scaling configurations. GGUF metadata output remains functionally equivalent.

@loci-review
Copy link

loci-review bot commented Dec 14, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #560

Overview

This PR refactors RoPE (Rotary Position Embedding) parameter handling in the Python conversion script convert_hf_to_gguf.py. The changes affect model conversion workflow only, with no modifications to C++ inference binaries or performance-critical execution paths.

Analysis Results

Performance Metrics: No function-level performance data available. The summary report returned "no_data" status, indicating no measurable changes in Response Time or Throughput across all analyzed functions.

Power Consumption: All 16 binaries show 0.0% change in estimated power consumption between versions. The three binaries with negligible deltas (< 0.001%) are:

  • libllama.so: -0.56 nJ (-0.0003%)
  • llama-run: -0.19 nJ (-0.0001%)
  • llama-tts: -1.05 nJ (-0.0004%)

These sub-nanojoule variations represent measurement noise rather than functional changes.

Code Changes: The PR centralizes RoPE configuration management by introducing self.rope_parameters in ModelBase.__init__ and consolidating scaling logic in the base set_gguf_parameters() method. This eliminates 269 lines of duplicated code across 30+ model classes while adding 88 lines of unified handling logic.

Inference Impact: None. The refactoring affects only the Python-based model conversion process. No changes were made to:

  • llama_decode - primary inference function
  • llama_encode - prompt processing
  • llama_tokenize - tokenization logic
  • ggml_mul_mat - matrix operations
  • ggml_rope - RoPE computation kernels

The C++ inference engine remains unchanged, meaning tokens per second performance is unaffected. RoPE parameters are read from GGUF metadata at model load time, not during token generation.

Conclusion: This is a code quality improvement with zero runtime performance impact. The conversion script refactoring improves maintainability and adds support for new HuggingFace Transformers config formats without affecting inference speed or power consumption.

@loci-dev loci-dev force-pushed the main branch 23 times, most recently from 765e416 to 3c6cece Compare December 16, 2025 18:12
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 7ac0e44 to 5b544dd Compare December 22, 2025 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants