Beating GPT-2 for <<$100: the nanochat journey #481
Replies: 22 comments 17 replies
-
|
This is really impressive engineering work — thanks for writing it up so clearly. One thing I keep wondering about in these “time-to-GPT-2” style experiments: My hunch is that some of the remaining fragility or spread we attribute to scaling efficiency might actually sit upstream, in what enters the training pipeline and under what human-side constraints, rather than in the training dynamics themselves. Curious if you’ve seen anything along these lines, or if this is something you’ve intentionally held fixed. |
Beta Was this translation helpful? Give feedback.
-
|
i hope you will explain to us or build full course about Multi-GPU / multi-node distributed training |
Beta Was this translation helpful? Give feedback.
-
|
Kudos to sensei 🥹 |
Beta Was this translation helpful? Give feedback.
-
|
🫡 $43K to $73 in 7 years. At ~2.5x per year cost decline, the barrier to learning ML is lowering fast. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you so much! Anyone willing to share the base model checkpoint? I would like to do some finetuning |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
@karpathy Why are there no learnable parameters in rms_norm? |
Beta Was this translation helpful? Give feedback.
-
|
@karpathy it’d be great to explicitly document the data mixture, effective token count, scaling regime, and loss vs compute curves to make the GPT-2 comparison more rigorous. |
Beta Was this translation helpful? Give feedback.
-
|
Does anybody have a rough idea of how this compares in terms of average power consumption for the total training time? How does a cluster of 32 TPU v3 compare with an 8xH100 node? |
Beta Was this translation helpful? Give feedback.
-
|
@karpathy - any reason you didn't try µP to parametrize and find optimal hyperparameters in smaller variants? (https://github.com/microsoft/mup)? |
Beta Was this translation helpful? Give feedback.
-
|
Great job! |
Beta Was this translation helpful? Give feedback.
-
|
Really impressive work, thanks for sharing the repo. Would love to see a tutorial or a YouTube walkthrough of this at some point. |
Beta Was this translation helpful? Give feedback.
-
|
If Value Embeddings act as high-capacity, zero-FLOP anchors at alternating layers, are we effectively moving away from a 'pure' Transformer toward a Hybrid Memory-Compute architecture? I'm curious if this indicates that at low token counts, the model needs to 'memorize' the manifold structure via VE because it doesn't have enough gradient steps to 'learn' the logic via the Muon-constrained weights. also another doubt I haddd.. Muon's Polar Express orthogonalization keeps weights on the Stiefel manifold, but does this create a 'representation bottleneck' for non-linear feature emergence? I wonder if the success of the this is so damn aweosmeeeeyyy.. @karpathy |
Beta Was this translation helpful? Give feedback.
-
|
Even if we can't reach cheaper than this on a nanochat the concept of having such resource-efficient ML model can be used in so many other contexts. Great job, man! |
Beta Was this translation helpful? Give feedback.
-
|
Wow |
Beta Was this translation helpful? Give feedback.
-
@karpathy Could you share what methods didn't scale - worked on smaller scale but not bigger models. Quite interesting to understand what areas can be experimented on a smaller budget. Saw only general section of what worked, what didn't work |
Beta Was this translation helpful? Give feedback.
-
|
@karpathy Hi Karpathy, I’m one of the authors of NorMuon. I noticed that in the Optimizer section you omitted NorMuon and instead described it directly as “Factored variance reduction (Adafactor-style).” Although NorMuon and Adafactor share some similarities in form, their motivations and effects can be quite different. Adafactor reduces memory cost by using a low-rank approximation to fit the second-order moment matrix used by Adam for preconditioning. In contrast, Muon itself no longer relies on second-order moments for preconditioning; instead, preconditioning is performed implicitly via We appreciate the summary of “Factored variance reduction” and the analogy to Adafactor, but we would like to ask whether you could add a reference to NorMuon in the corresponding place. This would allow readers to refer to our work and understand our original motivation, rather than mistakenly assuming this is a low-rank preconditioning approximation approach similar to Adafactor. Thank you! |
Beta Was this translation helpful? Give feedback.
-
|
牛哇! GOOD! |
Beta Was this translation helpful? Give feedback.
-
|
I've been training some GPT-ish model myself following your inspiration @karpathy. The fact that you have been able to train a GPT-2 model for so cheap is absolutely outstanding work. Congratulation on the achievement! |
Beta Was this translation helpful? Give feedback.
-
|
Just pulled the updates yesterday - very cool stuff, but on the other hand the further the architecture gets from vanilla the more awkward it is to use as a base for other architectural experiments (which is what I found most attractive about the project - I like to learn by breaking stuff lol). Most of the changes didn't affect my private ramblings (on the contrary, efficiency boosts are welcome), but the value embeddings blindsided me and threw a rather large spanner in my works. Easy enough to disable by having has_ve() always return False, but I get the feeling it will also throw the training run parameters out for equivalent performance/accuracy. As a minimal change I've reverted that to the old Chinchilla 20x ratio for the baseline I'm comparing my modifications against, I guess I'll see if that works out equivalent to the lower token/params ratio that was being used with VE soonish... but should I be adjusting anything else to compensate for turning VE off? |
Beta Was this translation helpful? Give feedback.
-
|
How about GPT 3 and 4 ? any idea/ plan ? |
Beta Was this translation helpful? Give feedback.
-
|
Can someone please start toying with the following?
|
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
When OpenAI released GPT-2 in February 2019, training the largest model (1.5B parameters) required serious compute:
32 × 168 × $8 =$43,000Sources: Reddit thread from 2019, HuggingFace model card.
Beating GPT-2 for <$100 from scratch has been a bit of an odd obsession for me but finally here we are. Seven years later, we can beat GPT-2's performance in nanochat ~1000 lines of code running on a single 8XH100 GPU node for ~3 hours. At ~$24/hour for an 8×H100 node, that's $73, i.e. ~600× cost reduction. That is, each year the cost to train GPT-2 is falling to approximately 40% of the previous year. (I think this is an understimate and that further improvements are still quite possible). The gains come from everywhere: better hardware (H100 vs TPU v3), better software (Flash Attention 3, torch.compile), better algorithms (Muon optimizer, architectural improvements), and better data (FineWeb-edu).
Above: a nicely uneventful run of training a GPT-2 capability model, this one even a little bit better after tuning the warmdown ratio slightly from 0.4 to 0.5. The training time on x axis appears a bit longer on wandb because it includes inline evaluation.
The Goal
Our target is the CORE metric from the DCLM paper—a comprehensive evaluation across 22 high-quality benchmarks. GPT-2's CORE score is 0.256525. I introduced a new leaderboard to track how long it takes to reach this performance:
The leaderboard tracks wall-clock training time (excluding eval/logging) to beat GPT-2's CORE score on 8×H100. The leaderboard is very much inspired by the one in modded-nanogpt repo, except our target is CORE score instead of validation loss, and our goal is GPT-2 specifically. Contributions to improve on this are welcome! Most of your work will probably be in only one of 3 files:
base_train.py(main driver),gpt.py(arch), andoptim.py(optimizer), though it's possible that gains can be made by tuning the dataset or the tokenizer as well.The Jan 29 Model: Architecture Deep Dive
A few words on the current record-holding model.
Model Architecture (
nanochat/gpt.py)The basics:
(wte are token embeddings, transformer_matrices are projections inside the transformer (MLP and attention)).
Departures from vanilla Transformer:
RoPE instead of learned positional embeddings. Standard now, but worth noting. Base theta 10,000, computed once and cached.
RMSNorm everywhere, no learnable params. Just
F.rms_norm(x, (x.size(-1),)). No gamma/beta. Applied after embedding, before each attention/MLP, and before lm_head.QK normalization. After applying RoPE to Q and K, we normalize them:
q, k = norm(q), norm(k). Stabilizes attention without softcapping the attention weights.Untied embedding/unembedding.
wteandlm_headare separate parameters with different initializations and learning rates.ReLU² activation.
F.relu(x).square()instead of GELU. Sparse and cheap.Logit softcapping.
15 * tanh(logits / 15)bounds logits to [-15, 15]. Computed in float32.Sliding window attention. Pattern
SSSL= 3 short-window layers (1024 tokens), 1 long-window layer (2048 tokens), tiled across depth. Final layer always full context. I've first seen this in the GPT-3 paper. Flash Attention 3 makes this very efficient with their support forwindow_sizekwarg.Value Embeddings (VE). At alternating layers, we add a gated value embedding to the V tensor:
These add massive parameter count (~150M for d24) at near-zero FLOPs.
Per-layer residual scalars. Two learnable scalars per layer:
Where
x0is the initial normalized embedding.resid_lambdasinit to 1.0,x0_lambdasinit to 0.1.Flash Attention 3. Native
(B, T, H, D)layout. Falls back to PyTorch SDPA on non-Hopper GPUs.Optimizer (
nanochat/optim.py)Split optimizer design: AdamW for embeddings/scalars, Muon for weight matrices.
AdamW groups:
lm_head: lr=0.004, scaled by 1/√(dim/768)wte+value_embeds: lr=0.3, same scalingresid_lambdas: lr=0.005 (scalar_lr × 0.01)x0_lambdas: lr=0.5, beta1=0.96 (higher than default 0.8)Muon for matrix params:
Muon internals (see
muon_step_fused):grad * param >= 0, linear schedule to zeroDistributed optimizer (
DistMuonAdamW):Training Script (
scripts/base_train.py)Key hyperparameters:
Data pipeline:
<|bos|>Scaling via depth: The
--depthflag is the single knob. Everything else derives from it:model_dim = depth × 64num_heads = model_dim / 128The majority of these optimizations have been cherry picked and adapted from the modded-nanogpt repo. Not all of the things in the modded-nanogpt worked for nanochat, and based on some recent chatter - vice versa :)
The Optimization Journey
We started with a vanilla Transformer (learned positional embeddings, LayerNorm, GELU, AdamW, Flash Attention 2). Here's what changed.
What Worked
SSSLpattern. Compute savings without quality loss.x = λ_resid * x + λ_x0 * x0. Consistent improvement across all model sizes (0.003-0.01 bpb).x0_beta1=0.96is optimal at d20. Key lesson: small-scale tuning doesn't transfer. Validate at target scale.What Didn't Work
See
dev/LOG.mdfor detailed experiment notes on each. Note that it is very difficult (/impossible) to rule out an idea. Sometimes you have to try multiple times. I'm only chronicling some of the things that worked and didn't work out of the box, trying with at most medium amount of effort.Reproduce
Here is how I trained the Jan29 model on commit
348fbb3. Boot up your 8XH100 node (e.g. from Lambda or etc.), run the setup (seeruns/speedrun.sh, you can just run the commands individually one by one to set up the environment, download the data shards and train the tokenizer), then run pretraining like this:Wait 3 hours to see:
See
runs/speedrun.shscript for more detailed reference.If you don't have hundreds of hours to spend on training GPT-2, you can experiment and find improvements on much smaller scales, e.g. just use
--depth=12to train a d12 (it trains in only ~5 minutes), or try a d16. A lot of my iteration is on a smaller scale and many (but not all!) ideas that work there transfer to the bigger models.Discord
Come talk about further improvements on
#nanochaton our Discord, or alternative link to try.Acknowledgements
This work builds heavily on modded-nanogpt. Many winning ideas originated there: Muon improvements, Value Embeddings, per-layer scalars, etc. Thanks to HuggingFace for FineWeb-edu, to Tri Dao and friends for FA3 kernels, Lambda for compute, and everyone who contributed.
Samples fun
For fun, here are some of the samples from the model, printed by
base_eval.py.First, conditional samples. Prompts are:
The samples become:
So the model has pretty decent knowledge! Unconditional samples:
Beta Was this translation helpful? Give feedback.
All reactions