Beating GPT-2 for <<$100: the nanochat journey #481

karpathy · 2026-01-31T20:20:49Z

karpathy
Jan 31, 2026
Maintainer

When OpenAI released GPT-2 in February 2019, training the largest model (1.5B parameters) required serious compute:

Hardware: 32 TPU v3 chips (256 TPU v3 cores, 8 cores per chip)
Training time: "A bit over a week" (~168 hours)
Cloud cost: At $8/hour per TPU v3, that's 32 × 168 × $8 = $43,000

Sources: Reddit thread from 2019, HuggingFace model card.

Beating GPT-2 for <$100 from scratch has been a bit of an odd obsession for me but finally here we are. Seven years later, we can beat GPT-2's performance in nanochat ~1000 lines of code running on a single 8XH100 GPU node for ~3 hours. At ~$24/hour for an 8×H100 node, that's $73, i.e. ~600× cost reduction. That is, each year the cost to train GPT-2 is falling to approximately 40% of the previous year. (I think this is an understimate and that further improvements are still quite possible). The gains come from everywhere: better hardware (H100 vs TPU v3), better software (Flash Attention 3, torch.compile), better algorithms (Muon optimizer, architectural improvements), and better data (FineWeb-edu).

Above: a nicely uneventful run of training a GPT-2 capability model, this one even a little bit better after tuning the warmdown ratio slightly from 0.4 to 0.5. The training time on x axis appears a bit longer on wandb because it includes inline evaluation.

The Goal

Our target is the CORE metric from the DCLM paper—a comprehensive evaluation across 22 high-quality benchmarks. GPT-2's CORE score is 0.256525. I introduced a new leaderboard to track how long it takes to reach this performance:

#	Record time	Description	Date	Commit	Contributors
1	3.04 hours	d24 baseline, slightly overtrained	Jan 29 2026	`348fbb3`	@karpathy

The leaderboard tracks wall-clock training time (excluding eval/logging) to beat GPT-2's CORE score on 8×H100. The leaderboard is very much inspired by the one in modded-nanogpt repo, except our target is CORE score instead of validation loss, and our goal is GPT-2 specifically. Contributions to improve on this are welcome! Most of your work will probably be in only one of 3 files: base_train.py (main driver), gpt.py (arch), and optim.py (optimizer), though it's possible that gains can be made by tuning the dataset or the tokenizer as well.

The Jan 29 Model: Architecture Deep Dive

A few words on the current record-holding model.

Model Architecture (`nanochat/gpt.py`)

The basics:

24 layers, 1536 channels (depth × 64 aspect ratio), 12 heads with 128 head dim
Parameter counts:

wte                     : 50,331,648
value_embeds            : 603,979,776
lm_head                 : 50,331,648
transformer_matrices    : 679,481,856
scalars                 : 48
total                   : 1,384,124,976

(wte are token embeddings, transformer_matrices are projections inside the transformer (MLP and attention)).

Departures from vanilla Transformer:

RoPE instead of learned positional embeddings. Standard now, but worth noting. Base theta 10,000, computed once and cached.
RMSNorm everywhere, no learnable params. Just F.rms_norm(x, (x.size(-1),)). No gamma/beta. Applied after embedding, before each attention/MLP, and before lm_head.
QK normalization. After applying RoPE to Q and K, we normalize them: q, k = norm(q), norm(k). Stabilizes attention without softcapping the attention weights.
Untied embedding/unembedding. wte and lm_head are separate parameters with different initializations and learning rates.
ReLU² activation. F.relu(x).square() instead of GELU. Sparse and cheap.
Logit softcapping. 15 * tanh(logits / 15) bounds logits to [-15, 15]. Computed in float32.
Sliding window attention. Pattern SSSL = 3 short-window layers (1024 tokens), 1 long-window layer (2048 tokens), tiled across depth. Final layer always full context. I've first seen this in the GPT-3 paper. Flash Attention 3 makes this very efficient with their support for window_size kwarg.
Value Embeddings (VE). At alternating layers, we add a gated value embedding to the V tensor:
```
ve = value_embeds[layer_idx](token_ids)  # (B, T, kv_dim)
gate = 2 * sigmoid(ve_gate(x[:, :, :32]))  # range (0, 2)
v = v + gate * ve
```
These add massive parameter count (~150M for d24) at near-zero FLOPs.
Per-layer residual scalars. Two learnable scalars per layer:
```
x = resid_lambdas[i] * x + x0_lambdas[i] * x0
```
Where x0 is the initial normalized embedding. resid_lambdas init to 1.0, x0_lambdas init to 0.1.
Flash Attention 3. Native (B, T, H, D) layout. Falls back to PyTorch SDPA on non-Hopper GPUs.

Optimizer (`nanochat/optim.py`)

Split optimizer design: AdamW for embeddings/scalars, Muon for weight matrices.

AdamW groups:

lm_head: lr=0.004, scaled by 1/√(dim/768)
wte + value_embeds: lr=0.3, same scaling
resid_lambdas: lr=0.005 (scalar_lr × 0.01)
x0_lambdas: lr=0.5, beta1=0.96 (higher than default 0.8)

Muon for matrix params:

All attention projections (Q, K, V, O) and MLP weights
Grouped by shape, stacked for efficient batched updates

Muon internals (see muon_step_fused):

Nesterov momentum with warmup 0.85→0.95 over first 300 steps
Polar Express orthogonalization (5 iterations) instead of Newton-Schulz
Factored variance reduction (Adafactor-style): maintains low-rank second moment buffer
Cautious weight decay: only decays where grad * param >= 0, linear schedule to zero

Distributed optimizer (DistMuonAdamW):

ZeRO-2 style sharding: each rank owns a slice of optimizer state
3-phase async: launch reduce → compute updates → gather results
No DDP—gradient sync happens in the optimizer step

Training Script (`scripts/base_train.py`)

Key hyperparameters:

Batch size: 524,288 tokens (32 × 2048 × 8 GPUs)
Warmdown: 50% of training (linear LR decay to 0)
Weight decay: 0.2 at d12, scaled by (12/depth)²
Tokens:params ratio: 10.5 (compute-optimal), or 12 for speedrun (slight overtrain)

Data pipeline:

BOS-aligned dataloader: every sequence starts with <|bos|>
BestFit-Crop packing: 100% utilization, ~35% token waste from cropping
FineWeb-edu, ~8.8B tokens in total needed

Scaling via depth: The --depth flag is the single knob. Everything else derives from it:

model_dim = depth × 64
num_heads = model_dim / 128
Optimal token budget scales with depth
Weight decay scales with 1/depth²

The majority of these optimizations have been cherry picked and adapted from the modded-nanogpt repo. Not all of the things in the modded-nanogpt worked for nanochat, and based on some recent chatter - vice versa :)

The Optimization Journey

We started with a vanilla Transformer (learned positional embeddings, LayerNorm, GELU, AdamW, Flash Attention 2). Here's what changed.

What Worked

Flash Attention 3 — ~9% tok/sec improvement. Native tensor layout, single API for training and inference.
Sliding window attention — SSSL pattern. Compute savings without quality loss.
Muon optimizer overhaul — Polar Express, NorMuon variance reduction, cautious weight decay with linear schedule to zero. The cautious WD was a clear win. I tried to delete Muon and couldn't.
Per-layer residual scalars — x = λ_resid * x + λ_x0 * x0. Consistent improvement across all model sizes (0.003-0.01 bpb).
Value Embeddings at alternating layers — Models love the value embeddings capacity. Any attempt to reduce it (low-rank, sharing, projections) hurt. We tried U-shaped placement, every layer, alternating—alternating won.
BOS-aligned dataloader — Every row starts with BOS. Made midtraining unnecessary (deleted it). BestFit-Crop packing reduces waste vs naive cropping.
Hyperparameter sweep at scale — 320 experiments to find that x0_beta1=0.96 is optimal at d20. Key lesson: small-scale tuning doesn't transfer. Validate at target scale.
Scaling law discovery — We empirically measured the optimal tokens:params ratio to be ~10. It's important to do the actual experiment on your own network.

What Didn't Work

Multi-token prediction (MTP) — +13GB memory, no improvement
Varlen attention — BOS-aligned dataloader already handles this to some extent. Attending across BOS document boundaries does not seem to make things much worse.
FP8 for lm_head — Works, but +2GB memory (!), only 1% speedup, todo to look into more.
Half-truncated RoPE — No improvement
Asymmetric softcap — Slightly worse
Skip connections / backout — No improvement, +2GB memory
Smear gate, attention gates — Negligible improvement, not worth complexity
Batch size schedule — Deemed a little too complex
Bigram embeddings (Engram-lite) — Works, but not by too much, and it bloats complexity and parameter count by a lot, so it was skipped in the end.
Hyperball/MuonH — Intriguing idea, didn't work out of the box

See dev/LOG.md for detailed experiment notes on each. Note that it is very difficult (/impossible) to rule out an idea. Sometimes you have to try multiple times. I'm only chronicling some of the things that worked and didn't work out of the box, trying with at most medium amount of effort.

Reproduce

Here is how I trained the Jan29 model on commit 348fbb3. Boot up your 8XH100 node (e.g. from Lambda or etc.), run the setup (see runs/speedrun.sh, you can just run the commands individually one by one to set up the environment, download the data shards and train the tokenizer), then run pretraining like this:

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=24 \
    --run=d24-jan29 \
    --model-tag=d24_jan29 \
    --device-batch-size=16 \
    --sample-every=-1 \
    --save-every=-1 \
    --core-metric-max-per-task=-1 \
    --core-metric-every=3000 \
    --target-param-data-ratio=12

Wait 3 hours to see:

wandb: Run summary:
wandb:          core_metric 0.25851
wandb:                 step 16704
wandb: total_training_flops 4.330784131228946e+19
wandb:  total_training_time 10949.46713

CORE Score: 0.25851 (GPT-2: 0.256525) ✓
Training Time: 3.04 hours (10,949 seconds)
Cost: ~$73 (at ~$24/hour for 8×H100)

See runs/speedrun.sh script for more detailed reference.

If you don't have hundreds of hours to spend on training GPT-2, you can experiment and find improvements on much smaller scales, e.g. just use --depth=12 to train a d12 (it trains in only ~5 minutes), or try a d16. A lot of my iteration is on a smaller scale and many (but not all!) ideas that work there transfer to the bigger models.

Discord

Come talk about further improvements on #nanochat on our Discord, or alternative link to try.

Acknowledgements

This work builds heavily on modded-nanogpt. Many winning ideas originated there: Muon improvements, Value Embeddings, per-layer scalars, etc. Thanks to HuggingFace for FineWeb-edu, to Tri Dao and friends for FA3 kernels, Lambda for compute, and everyone who contributed.

Samples fun

For fun, here are some of the samples from the model, printed by base_eval.py.

First, conditional samples. Prompts are:

prompts = [
                "The capital of France is",
                "The chemical symbol of gold is",
                "If yesterday was Friday, then tomorrow will be",
                "The opposite of hot is",
                "The planets of the solar system are:",
                "My favorite color is",
                "If 5*x + 3 = 13, then x is",
            ]

The samples become:

--------------------------------------------------------------------------------
<|bos|>The capital of France is Paris. It is the largest city in France and the second largest city in Europe
--------------------------------------------------------------------------------
<|bos|>The chemical symbol of gold is Au. It is a soft, malleable, ductile, and lust
--------------------------------------------------------------------------------
<|bos|>If yesterday was Friday, then tomorrow will be Saturday. If today is Tuesday, then tomorrow will be Wednesday. If today is
--------------------------------------------------------------------------------
<|bos|>The opposite of hot is cold. The opposite of hot is cold. The opposite of hot is cold.
--------------------------------------------------------------------------------
<|bos|>The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune
--------------------------------------------------------------------------------
<|bos|>My favorite color is blue. I love the sky, the ocean, the blue sky. I love
--------------------------------------------------------------------------------
<|bos|>If 5*x + 3 = 13, then x is a factor of 5.

So the model has pretty decent knowledge! Unconditional samples:

--------------------------------------------------------------------------------
<|bos|>The announcement in July 2020 was further our plans to stop the spread of COVID-19 throughout our service by making and using disposable personal protective equipment in the Volvo VT 360 Sprint bioreversion and plug-in/mobile.
Young people, older people, and people with disabilities hesitate to join technology advantages programs. It is critical to discuss technology with the participants and be sure to challenge the ideas and stances that they will research before signing up on an organization. Remember that a student may have a diverse social circle outside of their country and be in a more favorable position to speak up about other parts of their truth by
--------------------------------------------------------------------------------
<|bos|>New York Times, Monday, September 22, 2012
From the Editor:
by Kenneth Koike, Follow him on Twitter, at KNOCKforlifestyle.blogspot.com, and email@example.com
Tax revolt in Alaska has just begun; it has arrived but been scattered with such force that a full state revolt will not likely take place this year. The surprise came again as the result of two consecutive rejections of this year’s election—thanks to two separate populations. It began at a vote in June that protected Alaska’s oil business until Congress made the law require it to limit its support for U
--------------------------------------------------------------------------------
<|bos|>Inquiry Based Projects in the Science Classroom: Bringing Foil Can Biology to Life
Because of my professional training, I am often asked about how you can bring biology and core features of Inquiry Based Science (IBS) to K-12 science students. I would like to share with other teachers and students what I've learned about tools to bring the inquiry process into the classroom and in the life of students. I will also share other projects I have participated in that yield tremendous results.
At the onset, I built a strong foundation of science teaching, such as classroom observation and TajidarK (Horwitz, D. &
--------------------------------------------------------------------------------
<|bos|>Civil war over the state flower of West Virginia never materialized. But the story of West Virginia’s Forgotten Flower stands as a reminder that weathering misfortune with grace and courage is exactly what many of the characters in my historical novels don’t have. But on a brighter happier note, the Forgotten Flower makes a special appearance: when the movie Kum 19 (1920) started filming in the early 1960s, it included a tribute to the Seneca Indians.
The Seneca, a North American Indian tribe, were historically Casimir Pulaski and they rose to prominence in the Revolutionary War after their loyalty
--------------------------------------------------------------------------------
<|bos|>Dialysis is becoming increasingly popular as a treatment alternative for patients with end-stage renal disease (ESRD) and diabetes. The United States of America (USA) is one of the leading dialysis centers worldwide. The demand for dialysis in Baltimore, Maryland is strong. This study assessed the willingness of patients to receive outpatient dialysis during the 2015 dialysis service season (March -June 2015). Non-sectional population-based study across community dialysis centers in Baltimore, Maryland, USA.
All registered dialysis centers in Baltimore, Maryland, USA (males 3.0%; females 6.60%) during the 2015 dialysis
--------------------------------------------------------------------------------
<|bos|>Freedom in the World
Freedom Rating (1 = best, 7 = worst)
Civil Liberties (1 = best, 7 = worst)
Political Rights (1 = best, 7 = worst)
Administrative changes in 2015 led to constitutional reforms that resulted in a new state legislative assembly. Earlier proposals by lawmakers to broaden the law’s authority to remove press-censorship-related laws were blocked, and opposition parties boycotted the legislative process.
After a decade of relative social and economic stability, Fiji was rocked by financial collapse and political coups, often blamed on the lack of
--------------------------------------------------------------------------------
<|bos|>SANEYLETH - Every few months we get evocative statements from our cities — we are on fire, we are drifting away from our Ṭula-, we have too much water and we can't get enough.
According to Sneennyards, sea level is rising! Camelond Bay is threatened with erosion!
But does the rise in sea level really tell anything about the water level in the PNF tankers that ply our urban waterways, our canals, our rivers?
“Most of the waterways are tidal with half a metre of ‘mean’ high tide,” goes todays Live Science headline. So
--------------------------------------------------------------------------------
<|bos|>The bladder is located in the lower part of the abdomen [Fig.1 a-d.]. The bladder stores urine until it leaves the body through a tube (urethra) that is passed through the opening (sphincter) of the vagina. The sphincter is under strong muscular and skeletal control. It keeps the urethra from leaking.
The author introduced this cadaver to describe the bladder:
The bladder is a hollow organ with a circular opening on the outside [Fig.2]. When we describe the bladder, we refer collectively to this organ and its different chambers as the "cylinder" [Fig.3

0errorr · 2026-01-31T21:26:51Z

0errorr
Jan 31, 2026

This is really impressive engineering work — thanks for writing it up so clearly.

One thing I keep wondering about in these “time-to-GPT-2” style experiments:
holding architecture + optimizer + compute fixed, have you ever tried systematically varying the data curation / filtering regime, rather than the training stack, and measuring how much variance shows up downstream?

My hunch is that some of the remaining fragility or spread we attribute to scaling efficiency might actually sit upstream, in what enters the training pipeline and under what human-side constraints, rather than in the training dynamics themselves.

Curious if you’ve seen anything along these lines, or if this is something you’ve intentionally held fixed.

3 replies

NoeFlandre Feb 1, 2026

Pretty interesting suggestion. I also gave this a thought and I think some interesting ablation studies might be worth a look. The first thing that came to my mind was to maybe try curriculum learning. It would look something like this :

Fix everything (seed, model depth, token budget, eval interval).

2 Compare:

Baseline random sampling
Curriculum warmup (first 10–20% tokens) then baseline sampling

I assume this would involve training a classifier first to score documents from Fineweb-Edu according to their "difficulty" (score categories could be 0-5). For this we can build a tiny subset of documents annotated by a LLM and train the classifier on it.

jubruckne Feb 1, 2026

I'm also curious about this. Have you ever tried to train the same (small) model with different training data permutations, and maybe also with different random seeds. I wonder what the random variance is for the same architecture, and whether minor differences in validation loss of different architectural tweaks are really meaningful.

nullbio Feb 22, 2026

https://arxiv.org/pdf/2406.11794

Boudaoud20 · 2026-02-01T01:14:41Z

Boudaoud20
Feb 1, 2026

i hope you will explain to us or build full course about Multi-GPU / multi-node distributed training

2 replies

serdardoesml Feb 1, 2026

Yeah learning resources about distributed training are relatively limited/not self-study friendly compared to general architecture/data stuff.

MostHumble Feb 16, 2026

https://huggingface.co/spaces/nanotron/ultrascale-playbook

Ki-Seki · 2026-02-01T03:47:20Z

Ki-Seki
Feb 1, 2026

Kudos to sensei 🥹

0 replies

cedrickchee · 2026-02-01T04:39:47Z

cedrickchee
Feb 1, 2026

🫡 $43K to $73 in 7 years. At ~2.5x per year cost decline, the barrier to learning ML is lowering fast.

0 replies

Tim-Siu · 2026-02-01T05:05:53Z

Tim-Siu
Feb 1, 2026

Thank you so much! Anyone willing to share the base model checkpoint? I would like to do some finetuning

0 replies

NoeFlandre · 2026-02-01T06:01:33Z

NoeFlandre
Feb 1, 2026

Hey! Amazing work, as always. I would love to join the discord but the link given is not working for me. Did someone figured out how to join? Even in the searchbar to look for servers, 'nanochat' or 'karpathy' is not showing me any results.

4 replies

karpathy Feb 1, 2026
Maintainer Author

server name is "Eureka Labs"

NoeFlandre Feb 1, 2026

Sorry to disturb again but even with "Eureka Labs" I cannot find anything.

heissanjay Feb 1, 2026

Try using the following link: discord.gg/3zy8kqD9Cp

NoeFlandre Feb 1, 2026

Yayyyy! Tysm, see u there :)

RohanKhanBD · 2026-02-01T15:10:15Z

RohanKhanBD
Feb 1, 2026

@karpathy Why are there no learnable parameters in rms_norm?

1 reply

karpathy Feb 1, 2026
Maintainer Author

they weren't needed when i last checked. worth trying again

geekylax · 2026-02-01T17:45:32Z

geekylax
Feb 1, 2026

@karpathy it’d be great to explicitly document the data mixture, effective token count, scaling regime, and loss vs compute curves to make the GPT-2 comparison more rigorous.
I will try to do myself as well here .

0 replies

aziis98 · 2026-02-01T19:46:59Z

aziis98
Feb 1, 2026

Does anybody have a rough idea of how this compares in terms of average power consumption for the total training time? How does a cluster of 32 TPU v3 compare with an 8xH100 node?

1 reply

lnicola Feb 2, 2026

Wikipedia lists TPU v3 as 220 W TDP and the H100 at 700 W TDP. That's 1183 KWh for the TPUs vs. 17 KWh for the H100s, about 70x less.

Now TDP isn't exactly the drawn power and I didn't factor in the power used by the servers running the boards, so you'll want to take it with a grain of salt.

amritbulusu · 2026-02-01T21:09:19Z

amritbulusu
Feb 1, 2026

@karpathy - any reason you didn't try µP to parametrize and find optimal hyperparameters in smaller variants? (https://github.com/microsoft/mup)?

2 replies

karpathy Feb 2, 2026
Maintainer Author

the repo uses ideas from muP already. i did not do a full coordinate check and someone can take this on.

amritbulusu Feb 5, 2026

Thanks for the reply @karpathy. We’re grateful for all your open source contributions.

I do see the spectral scaling in initialization that is muP compliant. There’s some published work on how muP for AdamW works with Muon too (https://arxiv.org/pdf/2505.02222). The only question I have for muon is, for non-square matrices do we need to rescale per layer to keep it width invariant?

As you said, a full coordinate check would help tweak this.

Lyshen · 2026-02-02T02:57:36Z

Lyshen
Feb 2, 2026

Great job!

0 replies

deriiinjv · 2026-02-02T06:09:50Z

deriiinjv
Feb 2, 2026

Really impressive work, thanks for sharing the repo. Would love to see a tutorial or a YouTube walkthrough of this at some point.

0 replies

riteshroshann · 2026-02-02T08:37:32Z

riteshroshann
Feb 2, 2026

If Value Embeddings act as high-capacity, zero-FLOP anchors at alternating layers, are we effectively moving away from a 'pure' Transformer toward a Hybrid Memory-Compute architecture? I'm curious if this indicates that at low token counts, the model needs to 'memorize' the manifold structure via VE because it doesn't have enough gradient steps to 'learn' the logic via the Muon-constrained weights.

also another doubt I haddd..

Muon's Polar Express orthogonalization keeps weights on the Stiefel manifold, but does this create a 'representation bottleneck' for non-linear feature emergence? I wonder if the success of the $x_0$ residual scalars is actually a hack to maintain a linear 'superhighway' that prevents the Muon-optimized layers from collapsing the rank of the hidden states too early.

this is so damn aweosmeeeeyyy.. @karpathy

0 replies

ThatCodingDonut · 2026-02-02T16:49:34Z

ThatCodingDonut
Feb 2, 2026

Even if we can't reach cheaper than this on a nanochat the concept of having such resource-efficient ML model can be used in so many other contexts. Great job, man!

0 replies

AbhimanyuAryan · 2026-02-03T00:02:00Z

AbhimanyuAryan
Feb 3, 2026

Wow

0 replies

aprotopopov · 2026-02-03T16:18:11Z

aprotopopov
Feb 3, 2026

A lot of my iteration is on a smaller scale and many (but not all!) ideas that work there transfer to the bigger models

@karpathy Could you share what methods didn't scale - worked on smaller scale but not bigger models. Quite interesting to understand what areas can be experimented on a smaller budget. Saw only general section of what worked, what didn't work

0 replies

lliu606 · 2026-02-05T01:30:55Z

lliu606
Feb 5, 2026

@karpathy Hi Karpathy, I’m one of the authors of NorMuon. I noticed that in the Optimizer section you omitted NorMuon and instead described it directly as “Factored variance reduction (Adafactor-style).” Although NorMuon and Adafactor share some similarities in form, their motivations and effects can be quite different.

Adafactor reduces memory cost by using a low-rank approximation to fit the second-order moment matrix used by Adam for preconditioning. In contrast, Muon itself no longer relies on second-order moments for preconditioning; instead, preconditioning is performed implicitly via Newton-Schulz. Our starting point was the observation that after Muon completes its preconditioning step, the scale of updates can vary significantly across neurons or columns (whereas Adam tends to produce more uniform update scales across neurons). Based on this, we argue that even after preconditioning, Muon still requires neuron-wise or columns-wise adaptive learning rate to determine the update scale. This point has also been discussed in the relevant PRs and subsequent discussions around modded-nanoGPT.

We appreciate the summary of “Factored variance reduction” and the analogy to Adafactor, but we would like to ask whether you could add a reference to NorMuon in the corresponding place. This would allow readers to refer to our work and understand our original motivation, rather than mistakenly assuming this is a low-rank preconditioning approximation approach similar to Adafactor. Thank you!

2 replies

karpathy Feb 5, 2026
Maintainer Author

Hello, thank you for pointing this out, I ported over some of this language a bit too hastily from modded-nanogpt I think. Tried to address in this git commit let me know if it looks better. Cheers! 718e5e9

lliu606 Feb 5, 2026

It looks very appropriate now! Thanks a lot!

AnnaTrainingG · 2026-02-05T08:07:20Z

AnnaTrainingG
Feb 5, 2026

牛哇！ GOOD!

0 replies

hugovergnes · 2026-02-06T05:21:06Z

hugovergnes
Feb 6, 2026

I've been training some GPT-ish model myself following your inspiration @karpathy. The fact that you have been able to train a GPT-2 model for so cheap is absolutely outstanding work. Congratulation on the achievement!

0 replies

yurivilmanis · 2026-02-15T14:34:12Z

yurivilmanis
Feb 15, 2026

Just pulled the updates yesterday - very cool stuff, but on the other hand the further the architecture gets from vanilla the more awkward it is to use as a base for other architectural experiments (which is what I found most attractive about the project - I like to learn by breaking stuff lol). Most of the changes didn't affect my private ramblings (on the contrary, efficiency boosts are welcome), but the value embeddings blindsided me and threw a rather large spanner in my works.

Easy enough to disable by having has_ve() always return False, but I get the feeling it will also throw the training run parameters out for equivalent performance/accuracy. As a minimal change I've reverted that to the old Chinchilla 20x ratio for the baseline I'm comparing my modifications against, I guess I'll see if that works out equivalent to the lower token/params ratio that was being used with VE soonish... but should I be adjusting anything else to compensate for turning VE off?

1 reply

yurivilmanis Feb 17, 2026

Update - using the 20x token ratio without VE gives slightly better results than the 8.25x ratio with VE - takes a little longer, but fine for my purposes (also using window pattern = "L" in this case).

pakarai · 2026-02-15T20:45:24Z

pakarai
Feb 15, 2026

How about GPT 3 and 4 ? any idea/ plan ?

0 replies

TomLucidor · 2026-02-16T03:49:01Z

TomLucidor
Feb 16, 2026

Can someone please start toying with the following?

Linear/Hybrid attention (e.g. DeltaNet/KDA, Mamba2/SSM, Lightning)
Add Token Order Prediction along side Multi-Token Prediction
BitNet a4.8 instead of assuming FP8 on everything
Hack bigram embedding more carefully
Multiple layers of learning like Titan and HOPE

1 reply

svlandeg Feb 17, 2026
Collaborator

Feel free to go ahead 😁

Beating GPT-2 for <<$100: the nanochat journey #481

Uh oh!

Uh oh!

karpathy Jan 31, 2026 Maintainer

The Goal

The Jan 29 Model: Architecture Deep Dive

Model Architecture (nanochat/gpt.py)

Optimizer (nanochat/optim.py)

Training Script (scripts/base_train.py)

The Optimization Journey

What Worked

What Didn't Work

Reproduce

Discord

Acknowledgements

Samples fun

Replies: 22 comments · 17 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karpathy Feb 1, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karpathy Feb 1, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karpathy Feb 2, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karpathy
Jan 31, 2026
Maintainer

Model Architecture (`nanochat/gpt.py`)

Optimizer (`nanochat/optim.py`)

Training Script (`scripts/base_train.py`)

Replies: 22 comments 17 replies

karpathy Feb 1, 2026
Maintainer Author

karpathy Feb 1, 2026
Maintainer Author

karpathy Feb 2, 2026
Maintainer Author