This project investigates joint optimization of LoRA rank allocation and mixed-precision quantization for efficient fine-tuning on laptop-class hardware.
# Install dependencies
pip install -r requirements.txt
# Run single experiment
python run_experiment.py --config B-FP --task sst2 --model bert-base-uncased
# Run full experiment matrix
python run_all_experiments.py --epochs 3 --batch-size 8Apple Silicon Macs have MPS (Metal Performance Shaders) limitations that can cause tensor size errors. The system automatically detects and applies MPS-safe settings:
# Automatic MPS detection (recommended)
python run_experiment.py --config B-FP --task sst2 --model bert-base-uncased
# Force MPS-safe settings (smaller batch sizes)
python run_experiment.py --config B-FP --task sst2 --model bert-base-uncased --mps-safe
# Manual batch size reduction
python run_experiment.py --config B-FP --task sst2 --model bert-base-uncased --batch-size 4If you encounter MPS tensor size errors, try:
- Reduce batch size:
--batch-size 4or--batch-size 2 - Use Docker with GPU backend (recommended for quantization)
- Disable quantization:
--quant-backend none
# Build and run on GPU
./scripts/docker-run.sh --build --gpu --wandb-key $WANDB_API_KEY run-all
# Run single experiment on CPU
./scripts/docker-run.sh --cpu run B-FP sst2
# Interactive development
./scripts/docker-run.sh --gpu shell- B-FP: Baseline fixed-rank LoRA (FP16)
- B-Q4: Baseline 4-bit QLoRA
- B-Ada: Baseline AdaLoRA (adaptive rank)
- Joint-1/2/3: Combined adaptive rank + quantization
Use --quant-backend to control quantization strategy:
auto: Auto-detect best backend (default)cuda-4bit: 4-bit quantization on GPUcuda-8bit: 8-bit quantization on GPUcpu-int8: 8-bit CPU quantizationnone: Disable quantization
Examples:
# Force CPU quantization
python run_experiment.py --config B-Q4 --task sst2 --quant-backend cpu-int8
# GPU-only 4-bit (fails on non-CUDA)
python run_experiment.py --config B-Q4 --task sst2 --quant-backend cuda-4bit
# Disable quantization
python run_experiment.py --config B-Q4 --task sst2 --quant-backend noneIf you see:
MPSNDArray.mm:788: failed assertion `[MPSNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32'
Solutions:
- Use smaller batch sizes:
--batch-size 4or--batch-size 2 - Use Docker with GPU: Recommended for quantization experiments
- Disable quantization:
--quant-backend nonefor local testing
For systems with limited RAM:
- Reduce batch size:
--batch-size 4 - Use gradient accumulation (automatically adjusted)
- Close other applications
On Apple Silicon:
- Use
--quant-backend nonefor local development - Use Docker with GPU for full quantization experiments
- CPU quantization:
--quant-backend cpu-int8(slower but stable)
results/
├── results_B-FP_sst2.json # Individual experiment results
├── results_B-Q4_wikitext2.json
├── all_results.json # Aggregated results
└── experiment_summary.csv # Summary table
- Local: 16GB+ RAM, Apple M-series or Intel/AMD CPU
- Docker: NVIDIA GPU with 8GB+ VRAM (recommended)
- Quantization: CUDA-compatible GPU or CPU fallback