microgpt (optimized + CUDA)

A minimal GPT project with two aligned implementations:

microgpt.py: pure Python, dependency-free reference implementation.
microgpt_cuda.cu: CUDA/C++ implementation for Windows (MSVC + CUDA), optimized for speed while keeping the same model/training logic.

What this repo focuses on

Keep the project small and readable.
Preserve algorithmic parity between Python and CUDA paths.
Push performance through GPU residency and kernel fusion where it matters.

Core model/training recipe (both paths):

Character tokenizer with <BOS>.
GPT-style block with RMSNorm, causal multi-head attention, and ReLU^2 MLP.
Weight tying (wte reused as LM head).
AdamW + cosine LR + global grad clipping.
Train/val split, periodic validation, top-k sampling inference.

Repository layout

microgpt.py: full Python algorithm (train + val + inference).
microgpt_cuda.cu: full CUDA/C++ algorithm (train + val + inference).
microgpt_optimized.html: side-by-side Python/CUDA code converter view.
CMakeLists.txt: CUDA build entry.
input.txt: corpus (auto-downloaded if missing on first run).

Quick start (Python)

python microgpt.py

If input.txt is missing, the script downloads the default names dataset automatically.

Quick start (CUDA / Windows)

Prerequisites:

NVIDIA GPU + compatible driver
CUDA Toolkit (your setup: CUDA 13.1)
Visual Studio 2022 (MSVC, x64 toolchain)
CMake 3.24+

Build:

cmake -S . -B build -G "Visual Studio 17 2022" -A x64 -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config Release

Run:

.\build\Release\microgpt_cuda.exe --help
.\build\Release\microgpt_cuda.exe

Smoke test:

.\build\Release\microgpt_cuda.exe --steps 5 --samples 3

CUDA CLI options

--steps <int>: training steps (default 500)
--val-every <int>: validation interval (default 100)
--val-docs <int>: max validation docs per eval (default 20)
--samples <int>: generated samples after training (default 20)
--top-k <int>: top-k for sampling (default 5)
--temperature <float>: sampling temperature (default 0.6)
--seed <int>: RNG seed (default 42)

Important implementation notes

CUDA path keeps parameters, gradients, and optimizer states on GPU.
Training step is fused into one kernel launch (forward + backward + grad clip + AdamW update).
Current fused implementation is specialized to n_layer = 1 (same as current Python config).
kMaxVocab = 256 in microgpt_cuda.cu; if your dataset exceeds this, increase it and rebuild.
Default CMAKE_CUDA_ARCHITECTURES is 86; set it to your GPU architecture when needed.

Code converter page

Open microgpt_optimized.html in a browser to switch between:

Python view
CUDA view
Bilingual side-by-side comparison

This is useful for checking one-to-one conceptual mapping between the two codebases.

Credits

Original microgpt idea and baseline by @karpathy:

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
assets		assets
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
input.txt		input.txt
microgpt.py		microgpt.py
microgpt_cuda.cu		microgpt_cuda.cu
microgpt_optimized.html		microgpt_optimized.html
microgpt_zh.py		microgpt_zh.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

microgpt (optimized + CUDA)

What this repo focuses on

Repository layout

Quick start (Python)

Quick start (CUDA / Windows)

CUDA CLI options

Important implementation notes

Code converter page

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

microgpt (optimized + CUDA)

What this repo focuses on

Repository layout

Quick start (Python)

Quick start (CUDA / Windows)

CUDA CLI options

Important implementation notes

Code converter page

Credits

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages