A clean, educational implementation of GPT-2 built from scratch using PyTorch. This project demonstrates the architecture and training of transformer-based language models.
- Complete GPT-2 Architecture: Full implementation including multi-head attention, feed-forward layers, and layer normalization
- Multiple Attention Mechanisms: Self-attention, causal attention, and multi-head attention
- Training Pipeline: Complete training loop with validation and loss tracking
- Text Generation: Sampling with temperature and top-k filtering
- Pretrained Weights: Download and use official OpenAI GPT-2 weights
- Educational Examples: Clear demonstrations of attention mechanisms and model usage
llm-from-scratch-gpt2/
├── src/ # Main source code
│ ├── models/ # Model architectures
│ │ ├── gpt2.py # Main GPT-2 model
│ │ └── block.py # Transformer block
│ ├── layers/ # Neural network layers
│ │ ├── multihead_attention.py
│ │ ├── causal_attention.py
│ │ ├── self_attention.py
│ │ ├── feed_forward.py
│ │ ├── layer_norm.py
│ │ └── rms_norm.py
│ ├── data/ # Data handling
│ │ ├── data_loader.py
│ │ └── tokenization.py
│ ├── training/ # Training utilities
│ │ ├── train.py
│ │ └── loss.py
│ └── utils/ # Utility functions
│ ├── text_generation.py
│ └── model_utils.py
├── scripts/ # Executable scripts
│ ├── train_gpt2.py # Training script
│ └── download_gpt2_weights.py # Download pretrained weights
├── examples/ # Usage examples
│ ├── basic_usage.py
│ └── attention_demo.py
├── data/ # Training data
│ └── verdict.txt
└── gpt2/ # Downloaded GPT-2 weights (gitignored)
- Clone the repository:
git clone [email protected]:vivek12345/gpt2-from-scratch.git
cd gpt2-from-scratch- Install dependencies (using uv or pip):
# Using uv (recommended)
uv sync
# Or using pip
pip install -r requirements.txtimport tiktoken
import torch
from src.models.gpt2 import GPTModel
from src.utils.text_generation import generate_text_simple
from src.data.tokenization import token_ids_to_text
# Initialize model
GPT2_CONFIG = {
"vocab_size": 50257,
"emb_dim": 768,
"context_length": 1024,
"num_layers": 12,
"num_heads": 12,
"qkv_bias": True,
"dropout": 0.1
}
model = GPTModel(GPT2_CONFIG)
tokenizer = tiktoken.get_encoding("gpt2")
# Generate text
output = generate_text_simple(
model=model,
tokenizer=tokenizer,
start_context="Once upon a time",
context_length=GPT2_CONFIG["context_length"],
max_new_tokens=50,
temperature=1.0
)
print(token_ids_to_text(output, tokenizer))Train your own GPT-2 model:
uv run scripts/train_gpt2.pyDownload official OpenAI GPT-2 weights:
uv run scripts/download_gpt2_weights.pyuv run examples/basic_usage.pyShows:
- Forward pass through the model
- Text generation with sampling
- Model statistics
uv run examples/attention_demo.pyDemonstrates:
- Self-attention (no masking)
- Causal attention (masked)
- Multi-head attention
- Token & Positional Embeddings: Convert tokens to dense vectors with position information
- Transformer Blocks (x12):
- Multi-head self-attention with causal masking
- Layer normalization
- Feed-forward network (expand to 4x, then project back)
- Residual connections
- Final Layer Norm: Stabilize outputs
- Output Projection: Map to vocabulary logits
The model uses the standard GPT-2 (124M) configuration:
- Vocabulary Size: 50,257 tokens
- Embedding Dimension: 768
- Context Length: 1024 tokens
- Transformer Layers: 12
- Attention Heads: 12 (64 dimensions each)
- Feed-Forward Hidden Size: 3072 (4 × 768)
The project includes three attention implementations:
- Self-Attention: Basic attention without masking (educational)
- Causal Attention: Masked attention for autoregressive generation
- Multi-Head Attention: Parallel attention heads for richer representations
- Cross-Entropy Loss: Standard language modeling objective
- AdamW Optimizer: Weight decay for regularization
- Learning Rate: 3e-4 (adjustable)
- Gradient Accumulation: Configurable batch sizes
- Validation: Track train/val loss during training
- Temperature Sampling: Control randomness (0 = greedy, >1 = more random)
- Top-K Sampling: Sample from top K most likely tokens
- Early Stopping: Optional end-of-sequence token
- Total Parameters: ~124M (117M for 124M config)
- Model Size: ~475 MB (float32)
- Context Window: 1024 tokens
- Training Speed: ~X tokens/second (depends on hardware)
# Test basic usage
uv run examples/basic_usage.py
# Test attention mechanisms
uv run examples/attention_demo.py
# Test training loop
uv run scripts/train_gpt2.pyThe project follows:
- Clear, educational code with comments
- Type hints where helpful
- Modular architecture
- Docstrings for all functions
This project is open source and available under the MIT License.
- Inspired by Sebastian Raschka's "Build a Large Language Model from Scratch"
- OpenAI for the original GPT-2 architecture and weights
- The PyTorch team for the excellent framework
For questions or suggestions, please open an issue on GitHub.
- Attention Is All You Need - Original Transformer paper
- Language Models are Unsupervised Multitask Learners - GPT-2 paper
- The Illustrated GPT-2 - Visual guide to GPT-2