UPSTREAM PR #17889: convert: allow using quantized Mistral weight#501
UPSTREAM PR #17889: convert: allow using quantized Mistral weight#501
Conversation
Co-authored-by: compilade <[email protected]>
|
Explore the complete analysis inside the Version Insights Based on the available context, I cannot provide a comprehensive performance analysis as the required performance metrics (version_id, version_id_base, project_id) and analysis tools are not accessible in the current conversation state. Available Information: Analysis Limitation:
I cannot determine the actual performance impact on inference operations like Expected Impact: To provide the requested analysis, please supply the project_id, version_id, and version_id_base parameters so I can retrieve the actual performance metrics. |
|
Explore the complete analysis inside the Version Insights Performance Review SummaryPR #501: Mistral Quantized Weight Conversion Support This PR modifies the model conversion utility ( Performance Impact: Zero impact on inference performance. The conversion script executes during model preparation, not during runtime inference. Power consumption analysis confirms 0.0% change across all binaries (libllama.so, llama-run, llama-cli, llama-server). No functions in the inference path (llama_decode, llama_encode, llama_tokenize) are modified. Tokens per second remains unchanged. The code changes are isolated to the conversion utility and do not affect the compiled binaries or runtime execution paths. The converter outputs standard GGUF format with dequantized weights, which llama.cpp processes identically to non-quantized models. |
8ed91d0 to
985a61f
Compare
691dba3 to
5c24b24
Compare
Mirrored from ggml-org/llama.cpp#17889
target model: https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512
need
--mistral-format, otherwise it has problem with tokenizer