Skip to content

Conversation

@rolandtannous
Copy link
Collaborator

@rolandtannous rolandtannous commented Sep 23, 2025

PROBLEM

Depends on unslothai/unsloth-zoo#302
The existing GGUF conversion system was non-functional, due to upstream changes in llama.cpp and broken llama.cpp integration. Users encountered critical issues when trying to convert fine-tuned models to GGUF format for deployment:

  • llama.cpp installation process frequently failed
  • Users were sometimes limited to only a few basic quantization types
  • Ollama Modelfile creation required users to manually call get_chat_template() as a prerequisite step
  • Several Ollama chat templates were missing required Modelfile directives (FROM, TEMPLATE) and failed when using ollama create

SOLUTION

Two-Stage Conversion Architecture

Two-stage conversion approach that separates high-precision base conversion from multi-target quantization:

  • Stage one: converts models to optimal intermediate precision formats (f32/f16/bf16) using convert-hf-to-gguf.py.
  • Stage two: applies llama-quantize for precise quantization to all requested formats.

Critical fix: Updated first-conversion precision logic as new llama.cpp versions no longer support requantizing from q8_0 format, preventing conversion failures with recent llama.cpp builds.

Full llama.cpp Quantization Support with Multi-Format Processing

Extended quantization method support to all quantization formats available in llama.cpp. Users can also specify multiple quantization formats in single operations:

quantization_method=["q8_0", "q4_k_m", "q5_k_m", "q2_k"]

The system performs the expensive initial conversion once, then generates all quantization variants from the intermediate representation, eliminating redundant processing and significantly reducing storage overhead and conversion time.

Modular llama.cpp Integration with Orchestrated Pipeline

Code now uses clean modular integration. The new save_to_gguf() function serves as the main orchestrator, delegating specialized operations to unsloth_zoo.llama_cpp modules:

  1. Installation verification via check_llama_cpp()
  2. Converter preparation via _download_convert_hf_to_gguf()
  3. Initial conversion via convert_to_gguf()
  4. Multi-quantization via quantize_gguf()

Enhanced Save Functions with Comprehensive Metadata

Redesigned save_pretrained_gguf():

  • Returns comprehensive metadata dictionary containing all conversion results, file locations, and model characteristics
  • VLM Detection: Automatic detection of Vision-Language Models with proper dual-file handling (model.gguf + mmproj.gguf)
  • GPT-OSS Support: Special handling for GPT-OSS architecture models requiring different conversion paths
  • Smart First-Conversion Selection: Automatically chooses optimal intermediate format based on target quantizations and hardware capabilities

Restructured push_to_hub_gguf():

  • Leverages Local Conversion: Calls save_pretrained_gguf() first, then systematically uploads results
  • Proper File Naming: Handles temporary directories and ensures correct model naming for Hub upload
  • Comprehensive Upload: Automatically uploads GGUF files, config.json, README.md, and Ollama Modelfile
  • Enhanced Error Handling: Improved error messages and cleanup procedures for failed upload operations

Automated Ollama Modelfile Creation

Template-to-Model Mapping System:
Introduced systematic model-to-template association via TEMPLATE_TO_MODEL_MAPPER and MODEL_TO_TEMPLATE_MAPPER. This eliminates the need for users to manually call get_chat_template() as a precondition, enabling automatic selection of appropriate chat templates for Ollama Modelfile generation based on model architecture.

Template Fixes and Additions:

  • Fixed missing FROM and TEMPLATE directives in broken Ollama templates for gpt-oss, qwen3, and Gemma3n architectures
  • Added new chat templates (Starling, Yi-chat) with proper Ollama formatting
  • Ensures all generated Modelfiles are immediately compatible with ollama create without manual intervention

Dependency Resolution and Architectural Improvements

Eliminated Circular Imports:
Relocated CHAT_TEMPLATES from chat_templates.py to dedicated template_mappers.py module, to allow calls from both save.py and chat_templates.py while avoiding circular import failure errors.

Testing

Multiple testing rounds during development and after initial branch commit to fork and final commit before PR.
Testing branches: https://github.com/unslothai/rolandtannous/unsloth-zoo@fix/llamacpp-compatibility-gguf-conversion and https://github.com/unslothai/rolandtannous/unsloth@fix/llamacpp-compatibility-gguf-conversion

End to End Testing:

  • Local and colab.
  • Tested both saving locally and pushing to hub
  • Tested and verified proper post-conversion inference usin llama.cpp llama-cli for text models and llama-mtmd-cli for multimodals
  • Tested creation of ollama models using generated Modelfile
  • Tested ollama model inference using ollama run model-name

Models Tested:

gptoss, llama3.1, llama3.2, Pixtral , Gemma3n, Gemma3, Gemma2, Qwen2, Qwen2.5, Qwen3, Mistral and Phi models

Also tested gpt-oss-20 on colab T4 . Link to notebook

Solves

#3348
#3297
#3090
#3229
#3215
#3202
#3194
#3133
#3124
#3040
#2984
#2950
#2860
#2667
#2580
#2526
#2478
#2399
#2370
#2365
#2360
#2326
#2321
#2290
#2209
#2193
#2115
#2058
#2007
#1917
#1905
#1903
#1846
#1781
#1729
#1721
#1645
#1610
#1546
#1504
#965
#835
#748
#785
#2098
#3050

@rolandtannous rolandtannous changed the title [Part1] Reinstate llama.cpp Compatibility and GGUF Conversion with Multiple Quantizations and Automated Ollama Modelfile Creation [Part2] Reinstate llama.cpp Compatibility and GGUF Conversion with Multiple Quantizations and Automated Ollama Modelfile Creation Sep 23, 2025
Copy link
Contributor

@danielhanchen danielhanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work

@danielhanchen
Copy link
Contributor

@mmathew23 @Datta0 Can you guys also review this - appreciate it :)

Copy link
Collaborator

@mmathew23 mmathew23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments, thanks!

@rolandtannous rolandtannous force-pushed the fix/llamacpp-compatibility-gguf-conversion branch from 07ea7f8 to 48adee8 Compare October 13, 2025 20:19
@danielhanchen danielhanchen merged commit 05e91e7 into unslothai:main Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants