Skip to content

Conversation

@rolandtannous
Copy link
Collaborator

Solves

Problem Description

Language models were experiencing multiple critical issues with the save and merge functionality:

  1. Merging and saving operations failing for language models
  2. Missing index files for sharded language models causing loading failures
  3. Push to hub merged functionality not working properly for language models
  4. Performance degradation in merged language models
  5. Inconsistent behavior between vision models and language models in merge operations

Solution

Extended the recently implemented save and merge logic from vision models (unsloth-zoo PR #158) to language models, ensuring consistent behavior across all model types.

Performance Benchmarking

Perplexity:

Used perplexity testing to compare model performance throughout its lifecycle. New merge logic shows measurable improvement over old merging approach:

Old Merge Logic:

Model Base Model Peft Merged load 4-bit Merged load 8-bit Merge load-16bit
mistral-v0.3-7b 5.228433 2.759816 4.065556 2.761464 2.759856
Phi-4 4.660052 3.523139 5.448963 3.521446 4.872157
Qwen 2.5 7B instruct 8.489007 3.380023 4.562289 3.701273 3.380541
Llama-3.2-1B-Instruct 15.281656 11.005575 14.249228 11.0179 11.007604
Llama-3.1-8B-Instruxt 10.789843 7.301686 9.339204 7.299874 7.300654

New Merge Logic:

Model Base Model Peft Merged load 4-bit Merged load 8-bit Merge load-16bit
mistral-v0.3-7b 5.228433 2.761002 2.763541 2.737924 2.736141
Phi-4 4.660052 3.519993 3.645571 3.435464 4.754354
Qwen 2.5 7B instruct 8.489007 3.380995 3.394698 3.705793 3.345016
Llama-3.2-1B-Instruct 15.281656 10.965628 11.067917 10.539675 10.550692
Llama-3.1-8B-Instruxt 10.789843 7.256952 7.435827 7.234497 7.240123

AIME eval for GRPO models

AIME 2024+2025 Evaluation Results:

  • Base model: 8.3% accuracy
  • PEFT model: 11.7% accuracy
  • Merged model 16 bits: 11.7% accuracy
  • Merged model 4 bits: 10.8% accuracy

Results confirm merged models maintain equivalent performance to PEFT models .

Testing

  • Perplexity Tests: Llama-3.1-8B, Llama-3.2, Phi-4, Mistral-7B, Qwen2.5-7B ✅
  • Push to Hub Test: Verified successful upload and retrieval of merged models ✅
  • Model Index Test: Confirmed proper index file generation for sharded models ✅
  • GRPO Performance Test: Validated using AIME evaluation benchmark ✅
  • OCR Evaluation: Extended vision model testing for comprehensive coverage. ✅

Evaluation Modules Added

Created reusable evaluation modules for:

  • Perplexity testing across model architectures
  • OCR evaluation for vision capabilities
  • AIME mathematical reasoning evaluation

Final notes

Users should use:

model.save_pretrained()  # For non-merged model saving
model.push_to_hub()      # For non-merged model hub upload

When not performing merge operations.

@danielhanchen danielhanchen merged commit c6b6208 into unslothai:main Jun 3, 2025
This was referenced Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Applying LoRA Doesn't Change Model Output

2 participants