-
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
[Part2] Reinstate llama.cpp Compatibility and GGUF Conversion with Multiple Quantizations and Automated Ollama Modelfile Creation #3356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
danielhanchen
merged 23 commits into
unslothai:main
from
rolandtannous:fix/llamacpp-compatibility-gguf-conversion
Oct 14, 2025
Merged
+2,908
−513
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
danielhanchen
requested changes
Sep 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work
a912ef6 to
922b41b
Compare
|
@mmathew23 @Datta0 Can you guys also review this - appreciate it :) |
mmathew23
reviewed
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments, thanks!
07ea7f8 to
48adee8
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PROBLEM
Depends on unslothai/unsloth-zoo#302
The existing GGUF conversion system was non-functional, due to upstream changes in llama.cpp and broken llama.cpp integration. Users encountered critical issues when trying to convert fine-tuned models to GGUF format for deployment:
get_chat_template()as a prerequisite stepFROM,TEMPLATE) and failed when usingollama createSOLUTION
Two-Stage Conversion Architecture
Two-stage conversion approach that separates high-precision base conversion from multi-target quantization:
Critical fix: Updated first-conversion precision logic as new llama.cpp versions no longer support requantizing from q8_0 format, preventing conversion failures with recent llama.cpp builds.
Full llama.cpp Quantization Support with Multi-Format Processing
Extended quantization method support to all quantization formats available in llama.cpp. Users can also specify multiple quantization formats in single operations:
The system performs the expensive initial conversion once, then generates all quantization variants from the intermediate representation, eliminating redundant processing and significantly reducing storage overhead and conversion time.
Modular llama.cpp Integration with Orchestrated Pipeline
Code now uses clean modular integration. The new
save_to_gguf()function serves as the main orchestrator, delegating specialized operations tounsloth_zoo.llama_cppmodules:check_llama_cpp()_download_convert_hf_to_gguf()convert_to_gguf()quantize_gguf()Enhanced Save Functions with Comprehensive Metadata
Redesigned
save_pretrained_gguf():Restructured
push_to_hub_gguf():save_pretrained_gguf()first, then systematically uploads resultsAutomated Ollama Modelfile Creation
Template-to-Model Mapping System:
Introduced systematic model-to-template association via
TEMPLATE_TO_MODEL_MAPPERandMODEL_TO_TEMPLATE_MAPPER. This eliminates the need for users to manually callget_chat_template()as a precondition, enabling automatic selection of appropriate chat templates for Ollama Modelfile generation based on model architecture.Template Fixes and Additions:
FROMandTEMPLATEdirectives in broken Ollama templates for gpt-oss, qwen3, and Gemma3n architecturesollama createwithout manual interventionDependency Resolution and Architectural Improvements
Eliminated Circular Imports:
Relocated
CHAT_TEMPLATESfromchat_templates.pyto dedicatedtemplate_mappers.pymodule, to allow calls from bothsave.pyandchat_templates.pywhile avoiding circular import failure errors.Testing
Multiple testing rounds during development and after initial branch commit to fork and final commit before PR.
Testing branches: https://github.com/unslothai/rolandtannous/unsloth-zoo@fix/llamacpp-compatibility-gguf-conversion and https://github.com/unslothai/rolandtannous/unsloth@fix/llamacpp-compatibility-gguf-conversion
End to End Testing:
llama-clifor text models andllama-mtmd-clifor multimodalsollama run model-nameModels Tested:
gptoss, llama3.1, llama3.2, Pixtral , Gemma3n, Gemma3, Gemma2, Qwen2, Qwen2.5, Qwen3, Mistral and Phi models
Also tested gpt-oss-20 on colab T4 . Link to notebook
Solves
#3348
#3297
#3090
#3229
#3215
#3202
#3194
#3133
#3124
#3040
#2984
#2950
#2860
#2667
#2580
#2526
#2478
#2399
#2370
#2365
#2360
#2326
#2321
#2290
#2209
#2193
#2115
#2058
#2007
#1917
#1905
#1903
#1846
#1781
#1729
#1721
#1645
#1610
#1546
#1504
#965
#835
#748
#785
#2098
#3050