Skip to content

UPSTREAM PR #17889: convert: allow using quantized Mistral weight#501

Open
loci-dev wants to merge 3 commits intomainfrom
upstream-PR17889-branch_ngxson-xsn/devstral2_convert
Open

UPSTREAM PR #17889: convert: allow using quantized Mistral weight#501
loci-dev wants to merge 3 commits intomainfrom
upstream-PR17889-branch_ngxson-xsn/devstral2_convert

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Dec 9, 2025

Mirrored from ggml-org/llama.cpp#17889

target model: https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512

need --mistral-format, otherwise it has problem with tokenizer

Co-authored-by: compilade <[email protected]>
@loci-review
Copy link

loci-review bot commented Dec 9, 2025

Explore the complete analysis inside the Version Insights

Based on the available context, I cannot provide a comprehensive performance analysis as the required performance metrics (version_id, version_id_base, project_id) and analysis tools are not accessible in the current conversation state.

Available Information:
The commit modifies convert_hf_to_gguf.py, a Python script for converting Hugging Face models to GGUF format. This is a conversion utility, not part of the inference runtime path.

Analysis Limitation:
Without access to:

  • Binary performance metrics from LCLM predictions
  • Function-level throughput and response time data
  • Flame graph comparisons
  • Power consumption analysis results

I cannot determine the actual performance impact on inference operations like llama_decode, llama_encode, or llama_tokenize.

Expected Impact:
Changes to the conversion script typically do not affect runtime inference performance or tokens per second, as this code executes during model preparation, not during inference. The inference performance is determined by the generated GGUF file structure and the runtime execution in llama.cpp, not by the conversion script itself.

To provide the requested analysis, please supply the project_id, version_id, and version_id_base parameters so I can retrieve the actual performance metrics.

@loci-review
Copy link

loci-review bot commented Dec 9, 2025

Explore the complete analysis inside the Version Insights

Performance Review Summary

PR #501: Mistral Quantized Weight Conversion Support

This PR modifies the model conversion utility (convert_hf_to_gguf.py) to support Mistral-format FP8-quantized models. The changes add tensor name mapping for .qscale_weight suffixes, remove a blocking error for quantized vision weights, and transform Mistral quantization config to HuggingFace format.

Performance Impact: Zero impact on inference performance. The conversion script executes during model preparation, not during runtime inference. Power consumption analysis confirms 0.0% change across all binaries (libllama.so, llama-run, llama-cli, llama-server). No functions in the inference path (llama_decode, llama_encode, llama_tokenize) are modified. Tokens per second remains unchanged.

The code changes are isolated to the conversion utility and do not affect the compiled binaries or runtime execution paths. The converter outputs standard GGUF format with dequantized weights, which llama.cpp processes identically to non-quantized models.

@loci-dev loci-dev force-pushed the main branch 23 times, most recently from 8ed91d0 to 985a61f Compare December 12, 2025 19:07
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 691dba3 to 5c24b24 Compare December 17, 2025 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants