This repository, the HLD Architecture Handbook, is designed to be a comprehensive, self-paced learning guide for mastering High-Level Design (HLD) and large-scale system architecture. We focus on providing intuitive definitions and in-depth explanations of core concepts, followed by structured design challenges. The ultimate goal is to help you understand the 'Why' behind every architectural choice—the trade-offs, constraints, and future-proofing considerations necessary for building systems at scale.
Audience: Engineers with basic programming knowledge looking to transition from small-scale development to designing highly scalable, reliable, and performant distributed systems.
The content is organized into three progressive categories:
| Folder | Category Name | Focus |
|---|---|---|
| 01-principles | Core Principles | Core theoretical concepts: Scale, Availability, CAP Theorem, and foundational architecture styles. |
| 02-components | Components Deep Dive | In-depth analysis of specialized databases, caching, sharding, messaging, and concurrency control. |
| 03-challenges | Design Challenges | Real-world design problems (e.g., URL Shortener, Twitter, E-commerce Flash Sale) applying the concepts learned in the first two categories. |
| system-design-reference.md | Quick Reference Guide | Latency numbers, comparison tables, formulas, and decision matrices. |
| resources-and-further-reading.md | Learning Resources | Books, papers, courses, blogs, and tools for continued learning. |
| README.md | (This File) | The main project index and roadmap. |
Each completed design challenge now includes 6 comprehensive files for complete understanding:
03-challenges/
└── 3.x.y-problem-name/
├── README.md # FULL comprehensive guide (primary document, replaces main design file)
├── quick-overview.md # Quick revision guide with core concepts, architecture flows, key takeaways
├── hld-diagram.md # System architecture diagrams (10-15 Mermaid diagrams)
├── sequence-diagrams.md # Detailed interaction flows (10-15 Mermaid diagrams)
├── this-over-that.md # In-depth design decisions & trade-offs analysis
└── pseudocode.md # Detailed algorithm implementations (10-20 functions)
Benefits:
- 📊 Visual Learning: 20-30 interactive Mermaid diagrams per challenge for system architecture and sequence flows
- 📁 Better Organization: Separate theory, visuals, design decisions, and implementations
- 🔗 Easy Navigation: README links directly to all supplementary files
- 🎨 Maintainable: Text-based diagrams and pseudocode that are version-controlled
- 🚀 GitHub Native: Renders beautifully in GitHub without external tools
- 🧠 Deep Understanding: Detailed "This Over That" analysis explains WHY each architectural choice was made
- 💻 Implementation Ready: Comprehensive pseudocode with time complexity analysis
- 📖 Quick Revision: quick-overview.md provides concise summaries for fast review
We will cover the following topics in sequence before moving to the Design Challenges.
Category 1: Core Principles (Folder: 01-principles)
| Topic ID | Concept |
|---|---|
| 1.1.1 | CAP Theorem |
| 1.1.2 | Latency, Throughput, and Scaling |
| 1.1.3 | Availability and Reliability |
| 1.1.4 | Data Consistency Models |
| 1.1.5 | Back-of-the-Envelope Calculations |
| 1.1.6 | Failure Modes and Fault Tolerance |
| 1.1.7 | Idempotency |
| 1.1.8 | Data Partitioning and Sharding |
| 1.1.9 | Replication Strategies |
| 1.1.10 | Message Delivery Guarantees |
| 1.2.1 | System Architecture Styles |
| 1.2.2 | Networking Components |
| 1.2.3 | API Gateway and Service Mesh |
| 1.2.4 | Domain-Driven Design (DDD) Basics |
| 1.2.5 | Service Discovery |
Category 2: Components Deep Dive (Folder: 02-components)
📁 Organized into 7 logical categories for easier navigation:
- 🌐 Communication (Protocols, APIs, Real-time, Load Balancers, API Gateway, Service Mesh, DNS)
- 🗄️ Databases (20 database deep dives including Object Storage, Time Series, Vector DBs, Distributed SQL & CQRS!)
- ⚡ Caching (Redis, Memcached, Consistent Hashing, CDN)
- 📨 Messaging & Streaming (Kafka, Spark, Flink, Message Queues, Event Sourcing)
- 🔒 Security & Observability (Auth, OAuth/JWT, Monitoring, Prometheus/Grafana, Logging, ELK Stack, Distributed Tracing)
- 🧮 Algorithms (Rate Limiting, Consensus, Locking, Bloom Filters)
- 🏗️ Infrastructure (Kubernetes, Docker, Configuration Management, Infrastructure as Code)
2.0 Communication (Folder: 2.0-communication)
| Topic ID | Concept | Focus |
|---|---|---|
| 2.0.1 | Foundational Communication Protocols | TCP vs. UDP, HTTP/S, WebSockets, WebRTC, DASH. |
| 2.0.2 | API Communication Styles | REST, gRPC, SOAP, GraphQL (Pros, Cons, and Use Cases). |
| 2.0.3 | Real-Time Communication | Comparison of techniques for maintaining persistent or near-persistent connections for real-time updates. |
| 2.0.4 | Load Balancers Deep Dive | Layer 4 vs Layer 7, algorithms, health checks, SSL termination, sticky sessions. |
| 2.0.5 | API Gateway Deep Dive | Request routing, authentication, rate limiting, protocol translation, BFF pattern, service aggregation. |
| 2.0.6 | Service Mesh Deep Dive | Sidecar pattern, mTLS, retries, circuit breakers, traffic management, distributed tracing. |
| 2.0.7 | DNS Deep Dive | DNS resolution, record types, caching, load balancing, geographic routing, failover, DNSSEC. |
2.1 Databases (Folder: 2.1-databases) — 20 Deep Dives
| Topic ID | Concept | Focus |
|---|---|---|
| 2.1.1 | RDBMS Deep Dive: SQL & ACID | Transactions, Isolation Levels, ACID vs. BASE. |
| 2.1.2 | NoSQL Deep Dive: The BASE Principle | Document Stores, Key-Value Stores, Column-Family. |
| 2.1.3 | Specialized Databases | Time-Series, Graph, Geospatial DBs (e.g., Redis Streams, Neo4j). |
| 2.1.4 | Database Scaling | Replication (Master-Slave), Federation, Sharding Strategies. |
| 2.1.5 | Indexing and Query Optimization | B-Trees, LSM-Trees, Denormalization Trade-offs. |
| 2.1.6 | Data Modeling for Scale (CQRS) | Denormalization, Data Decomposition, Command-Query Responsibility Segregation (CQRS). |
| Topic ID | Concept | Focus |
|---|---|---|
| 2.1.7 | PostgreSQL Deep Dive | MVCC, JSONB, Full-Text Search, PostGIS, Advanced Indexing (GIN, BRIN), Replication, Extensions. |
| 2.1.8 | MySQL Deep Dive | InnoDB Storage Engine, MVCC, Replication (Async, Semi-Sync, Group), Indexing (B+Tree), ProxySQL. |
| Topic ID | Concept | Focus |
|---|---|---|
| 2.1.9 | Cassandra Deep Dive | Masterless Architecture, Wide-Column Store, Tunable Consistency, Write Path, Compaction, Multi-DC. |
| 2.1.10 | MongoDB Deep Dive | Document Model (BSON), Embedded vs. Referenced, Aggregation Framework, Sharding, Change Streams. |
| 2.1.11 | Redis Deep Dive | In-Memory Data Structures (Strings, Lists, Sets, Sorted Sets), Persistence (RDB/AOF), Cluster. |
| 2.1.12 | DynamoDB Deep Dive | Serverless NoSQL, Partition/Sort Keys, GSI/LSI, On-Demand vs. Provisioned, Global Tables, Streams. |
| Topic ID | Concept | Focus |
|---|---|---|
| 2.1.13 | Elasticsearch Deep Dive | Inverted Indexes, Full-Text Search, Aggregations, Integration with RDBMS (CDC), Sharding, ILM. |
| 2.1.14 | Neo4j Deep Dive (Graph Databases) | Property Graph Model, Cypher Query Language, Index-Free Adjacency, Graph Algorithms. |
| 2.1.15 | ClickHouse Deep Dive (Columnar) | Columnar Storage, MergeTree Engine, Vectorized Query Execution, OLAP Workloads. |
| 2.1.16 | Object Storage Deep Dive | S3, GCS, Azure Blob, multipart uploads, lifecycle policies, storage classes, CDN integration. |
| 2.1.17 | Time Series Databases Deep Dive | InfluxDB, TimescaleDB, Prometheus, compression, retention policies, downsampling, IoT data. |
| 2.1.18 | Vector Databases Deep Dive | Pinecone, Weaviate, Milvus, FAISS, semantic search, embeddings, k-NN algorithms, AI/ML applications. |
| 2.1.19 | Distributed SQL Databases Deep Dive | CockroachDB, TiDB, Google Spanner, YugabyteDB, Raft consensus, multi-region, ACID at scale. |
| 2.1.20 | CQRS Deep Dive | Command-Query Responsibility Segregation, read/write separation, eventual consistency, multiple read models, synchronization strategies. |
2.2 Caching (Folder: 2.2-caching)
| Topic ID | Concept | Focus |
|---|---|---|
| 2.2.1 | Caching Deep Dive | Cache-Aside, Write-Through, CDN vs. App-Level Cache. |
| 2.2.2 | Consistent Hashing | Algorithm mechanics, Ring implementation, how it minimizes data movement. |
| 2.2.3 | Memcached Deep Dive | In-Memory Key-Value Cache, Slab Allocation, LRU Eviction, Multi-Threading. |
| 2.2.4 | CDN Deep Dive | Content Delivery Networks, edge caching, cache invalidation, push vs pull, global distribution. |
2.3 Messaging & Streaming (Folder: 2.3-messaging-streaming)
| Topic ID | Concept | Focus |
|---|---|---|
| 2.3.1 | Asynchronous Communication | Queues vs. Streams, Pub/Sub Models, Backpressure. |
| 2.3.2 | Kafka Deep Dive | Broker, Producer, Consumer Group, Partitions, Offset Management, Log Compaction. |
| 2.3.3 | Advanced Message Queues (RabbitMQ, SQS, SNS) | Comparison of broker-based vs. managed queues, Dead-Letter Queues (DLQs). |
| 2.3.4 | Distributed Transactions & Idempotency | Two-Phase Commit (2PC), Sagas, ensuring atomic operations. |
| 2.3.5 | Batch vs Stream Processing | Detailed look at the Lambda and Kappa Architectures, latency vs. completeness trade-offs. |
| 2.3.6 | Push vs Pull Data Flow | Architectural choices in messaging systems (e.g., Kafka (Pull) vs. RabbitMQ (Push)). |
| 2.3.7 | Apache Spark Deep Dive | Unified Analytics Engine, RDD/DataFrame API, In-Memory Computing, MLlib, Batch & Stream. |
| 2.3.8 | Apache Flink Deep Dive | True Stream Processing, Event-by-Event, Stateful Operators, Exactly-Once, CEP, Ultra-Low Latency. |
| 2.3.9 | Event Sourcing Deep Dive | Immutable event logs, state reconstruction, snapshots, event store design, time travel, audit trail. |
2.4 Security & Observability (Folder: 2.4-security-observability)
| Topic ID | Concept | Focus |
|---|---|---|
| 2.4.1 | Security Fundamentals | Authn/Authz (JWT), TLS/Encryption, Cross-Site Scripting (XSS) & CSRF. |
| 2.4.2 | Observability | Logging, Metrics (Prometheus), Distributed Tracing (Jaeger/Zipkin), Alerting. |
| 2.4.3 | Prometheus & Grafana Deep Dive | Metrics collection, time-series storage, PromQL, dashboards, alerting, service discovery. |
| 2.4.4 | OAuth 2.0 & JWT Deep Dive | OAuth 2.0 flows, JWT structure, token management, refresh tokens, OIDC, security best practices. |
| 2.4.5 | ELK Stack & Logging Deep Dive | Elasticsearch, Logstash, Kibana, Beats, log parsing, retention, correlation, full-text search. |
| 2.4.6 | Distributed Tracing Deep Dive | Jaeger, Zipkin, OpenTelemetry, span propagation, sampling strategies, trace correlation, performance optimization. |
2.5 Distributed Algorithms (Folder: 2.5-algorithms)
| Topic ID | Concept | Focus |
|---|---|---|
| 2.5.1 | Rate Limiting Algorithms | Token Bucket, Leaky Bucket, Fixed Window counter mechanisms. |
| 2.5.2 | Consensus Algorithms | Paxos / Raft, Distributed Locks (ZooKeeper/etcd), solving the concurrency problem. |
| 2.5.3 | Distributed Locking |
|
| 2.5.4 | Bloom Filters | Intuition, Hash Functions, False Positives, use cases (e.g., CDN cache lookups). |
2.6 Infrastructure (Folder: 2.6-infrastructure)
| Topic ID | Concept | Focus |
|---|---|---|
| 2.6.1 | Kubernetes and Docker Deep Dive | Container orchestration, pods, services, deployments, auto-scaling, service discovery. |
| 2.6.2 | Configuration Management Deep Dive | etcd, Consul, Vault, service discovery, leader election, secrets management, watch API. |
| 2.6.3 | Infrastructure as Code Deep Dive | Terraform, CloudFormation, Pulumi, state management, modules, multi-environment, CI/CD. |
📊 Each challenge folder contains 6 comprehensive files:
- [README.md] - Complete comprehensive guide with all content (primary document, replaces main design file)
- [quick-overview.md] - Quick revision guide with core concepts, architecture flows, and key takeaways
- [hld-diagram.md] - 10-15 system architecture diagrams with detailed flow explanations
- [sequence-diagrams.md] - 10-15 interaction flows with step-by-step explanations
- [this-over-that.md] - In-depth analysis of 5-10 major design decisions and trade-offs
- [pseudocode.md] - 10-20 detailed algorithm implementations with complexity analysis
📊 Each challenge includes comprehensive visual diagrams (Mermaid) for system architecture and sequence flows!
These problems require solid application of scaling fundamentals, hashing, and database choices.
| Problem ID | System Name | Key Concepts Applied |
|---|---|---|
| 3.1.1 |
Design a URL Shortener ( |
Hashing, Base62 Encoding, Read-Heavy Scaling, Sharding Key, |
| 3.1.2 |
Design a Distributed Cache ( |
Consistent Hashing, Eviction Policies ( |
| 3.1.3 |
Design a Distributed ID Generator ( |
64-bit ID Structure, Worker ID Assignment, Clock Drift Handling, Sequence Management, |
These problems involve decoupling services, handling fan-out, and managing complex data models.
| Problem ID | System Name | Key Concepts Applied |
|---|---|---|
| 3.2.1 | Design a Twitter/X Timeline |
|
| 3.2.2 | Design a Notification Service |
|
| 3.2.3 | Design a Distributed Web Crawler |
|
| 3.2.4 | Design a Global Rate Limiter |
|
These problems require advanced pattern usage, strong consistency guarantees, and managing complex real-time state.
| Problem ID | System Name | Key Concepts Applied |
|---|---|---|
| 3.3.1 |
Design a Live Chat System ( |
|
| 3.3.2 | Design Uber/Lyft Ride Matching |
|
| 3.3.3 | Design an E-commerce Flash Sale |
|
| 3.3.4 | Design a Distributed Database |
|
| 3.4.1 | Design a Stock Exchange Matching Engine |
|
| 3.4.2 |
Design a Global News Feed ( |
|
| 3.4.3 |
Design a Distributed Monitoring System ( |
|
| 3.4.4 |
Design a Recommendation System ( |
|
| 3.4.5 |
Design a Stock Brokerage Platform ( |
|
| 3.4.6 |
Design a Collaborative Editor ( |
|
| 3.4.7 | Design an Online Code Editor / Judge |
|
| 3.4.8 |
Design a Video Streaming System ( |
|
| 3.5.1 |
Design a Payment Gateway ( |
|
| 3.5.2 |
Design Ad Click Aggregator ( |
|
| 3.5.3 | Design $\text{YouTube}$ $\text{Top}$ $\text{K}$ ($\text{Trending}$ $\text{Algorithm}$) |
|
| 3.5.4 | Design Instagram/Pinterest Feed |
|
| 3.5.5 |
Design Live Commenting ( |
|
| 3.5.6 | Design Yelp/Google Maps |
|
| 3.5.7 |
Design Authenticator App ( |
|
| 3.5.8 |
Design Single Sign-On (SSO) System ( |
|
- System Design Reference Guide: Quick-lookup tables for latency numbers, database comparisons, caching strategies, and more.
- Resources and Further Reading: Curated books, papers, courses, blogs, and tools to deepen your knowledge.
We highly encourage community contributions to expand this resource! Before submitting a Pull Request, please read and follow these guidelines:
- Clarity and Depth: Content must maintain the project's goal: providing intuitive, easy-to-understand definitions while retaining technical depth.
- Naming Convention: All new topic files must be placed in the correct category folder (e.g., 01-principles/,
02-components/) and follow the format:
[ID]-[short-name].md(e.g.,1.2.1-architecture-styles.md).
Use this structure for any new concept file. The file should provide a clear progression from basic intuition to technical details.
# [ID] Topic Title: Subtitle/Focus
## Intuitive Explanation
[Start with a simple, high-level analogy or definition that a beginner can grasp.]
## In-Depth Analysis
[Dive into the technical specifics, internal workings, and algorithms.]
### Key Concepts / Tradeoffs
* **Concept 1:** ...
* **Tradeoff:** [Discuss the pros/cons of a choice, e.g., speed vs. consistency.]
## 💡 Real-World Use Cases
* [List 2-3 specific examples of companies or scenarios where this concept is applied.]
---
## ✏️ Design Challenge
[Create a concise, open-ended question that forces the reader to apply the concepts from the file.]
When adding a new design challenge to 03-challenges/, create a folder 3.x.y-problem-name/ with 6 required files:
03-challenges/3.x.y-problem-name/
├── README.md # Main comprehensive guide (primary document, replaces old main design file)
├── quick-overview.md # Quick revision guide with core concepts, architecture flows, key takeaways
├── hld-diagram.md # 10-15 architecture diagrams (Mermaid)
├── sequence-diagrams.md # 10-15 sequence diagrams (Mermaid)
├── this-over-that.md # In-depth design decision analysis
└── pseudocode.md # Algorithm implementations
3.x.y-design-problem-name.md) should NOT exist in the final structure. Its
content should be moved to README.md, and a quick-overview.md file should be created for quick revision purposes.
REQUIRED STRUCTURE (must follow this exact order):
# [ID] Design a [System Name]
> 📚 **Note on Implementation Details:**
> This document focuses on high-level design concepts and architectural decisions.
> For detailed algorithm implementations, see **[pseudocode.md](./pseudocode.md)**.
## 📊 Visual Diagrams & Resources
- **[High-Level Design Diagrams](./hld-diagram.md)** - System architecture, component design, data flow
- **[Sequence Diagrams](./sequence-diagrams.md)** - Detailed interaction flows and failure scenarios
- **[Design Decisions (This Over That)](./this-over-that.md)** - In-depth analysis of architectural choices
- **[Pseudocode Implementations](./pseudocode.md)** - Detailed algorithm implementations
---
## 1. Problem Statement
[Clear problem description]
---
## 2. Requirements and Scale Estimation
### Functional Requirements
* [What the system MUST do]
### Non-Functional Requirements
* **Scale:** [e.g., 500M DAU]
* **QPS:** [Read: 100k, Write: 5k]
* **Latency:** [e.g., <100ms]
### Capacity Estimation
[Back-of-envelope calculations for storage, bandwidth, QPS]
## 3. High-Level Architecture
[ASCII diagram with main components]
## 4. Data Model
[Database schemas - use ```sql for SQL only]
## 5. Component Design
[Detailed component descriptions]
## 6. Why This Over That?
[Inline explanations for major choices: DB, cache, sync/async]
* **Why PostgreSQL over MongoDB?** [Rationale with bullets]
* **Why Kafka over RabbitMQ?** [Rationale with bullets]
## 7. Bottlenecks and Scaling
[Identify bottlenecks and future scaling strategies]
## 8. Common Anti-Patterns
❌ **Anti-Pattern:** [Bad approach]
✅ **Best Practice:** [Good approach]
## 9. Alternative Approaches
[Discuss 2-3 alternative architectures not chosen]
## 10. Monitoring and Observability
[Key metrics, alerts, dashboards]
## 11. Trade-offs Summary
[Final comparison table of all major decisions]
## 12. Real-World Examples
[How Twitter, Uber, etc. solve this problem]
# Design Decisions: [System Name]
## Decision 1: [e.g., Fanout Strategy]
### The Problem
[What are we trying to solve?]
### Options Considered
| Option | Pros | Cons | Performance | Cost |
|--------|------|------|-------------|------|
| Option A | ... | ... | ... | ... |
| Option B | ... | ... | ... | ... |
### Decision Made
[What we chose and why - 3-5 bullets]
### Rationale
1. [Detailed point 1]
2. [Detailed point 2]
### Trade-offs Accepted
[What we're sacrificing]
### When to Reconsider
[Conditions that would change this decision]
[Repeat for 5-10 major decisions]
## Summary Comparison
[Final table comparing all decisions]
# Pseudocode Implementations: [System Name]
## Table of Contents
- [Section 1: Feature Name](#section-1)
- [Section 2: Feature Name](#section-2)
## Section 1: Feature Name
### function_name()
**Purpose:** One-line description
**Parameters:**
- param1: type - description
- param2: type - description
**Returns:** return_type - description
**Algorithm:**
\`\`\`
function function_name(param1, param2):
// Detailed implementation
return result
\`\`\`
**Time Complexity:** O(n)
**Space Complexity:** O(1)
**Example Usage:**
\`\`\`
result = function_name(arg1, arg2)
\`\`\`
[Include 10-20 functions organized by feature]
Key Requirements:
- STANDARDIZED FORMAT: All README files MUST follow this exact structure:
- Title
- "Note on Implementation Details" block (referencing pseudocode.md)
- "📊 Visual Diagrams & Resources" section (with links to all supplementary files)
- Section numbering starts at "## 1. Problem Statement"
- Continue with "## 2. Requirements...", "## 3. High-Level Architecture", etc.
- README.md: NO programming language code, NO detailed pseudocode (describe in words, reference pseudocode.md)
- quick-overview.md: Concise revision guide (300-600 lines) with core concepts, architecture flows, key design decisions, bottlenecks, anti-patterns, trade-offs, real-world examples, and key takeaways
- All diagrams MUST have flow explanations (steps, benefits, trade-offs)
- this-over-that.md: 5-10 major decisions with detailed analysis
- pseudocode.md: 10-20 functions with complexity analysis
- See
03-challenges/3.1.1-url-shortener/as reference implementation