Methods for steering, fine-tuning, and aligning models after pretraining. These are the techniques that turn a base model into something useful.
Measured on Apple M-series, Python 3.12. Times are wall-clock.
microppo.py and micromoe.py use a hybrid autograd approach to meet runtime constraints:
- microppo: Policy model uses scalar autograd (
Valueclass). Reward model and value function use plain float arrays with manual gradients — they're trained separately before the PPO loop. - micromoe: Router uses scalar autograd. Expert MLPs use plain float arrays — the routing decision is the novel mechanism, not the expert forward pass.
See docs/autograd-interface.md for the canonical interface and docs/implementation.md for per-script details.
| Algorithm | What It Would Teach | Notes |
|---|---|---|
| Learning Rate Scheduling | Warmup, cosine decay, step decay | How schedule choice affects convergence |
| Knowledge Distillation | Training small models to mimic large ones | Compression via soft targets |
These scripts build on the foundations tier. Recommended order:
microbatchnorm.py → How normalizing activations stabilizes training
microdropout.py → How regularization prevents overfitting
microlora.py → How fine-tuning works efficiently (1% of parameters)
microqlora.py → How quantization combines with LoRA for memory efficiency
microreinforce.py → How policy gradients turn rewards into learning signals
microdpo.py → How preference alignment works (without reward model)
microppo.py → How RLHF works (the full reward → policy loop)
microgrpo.py → How DeepSeek simplified RLHF with group-relative rewards
micromoe.py → How sparse routing scales model capacity








