LLM Inference Server

Author: Rahul Surya Last Edit Date: 20th March 2026

Production-grade REST API for transformer model inference. Serves Phi-3-mini via Ollama with a prompt-level KV cache, Server-Sent Events streaming, Prometheus metrics, and a full Docker Compose stack.

Architecture

┌─────────────────────────────────────────────────────┐
│                  FastAPI Application                │
│                                                     │
│  POST /v1/generate         POST /v1/generate/stream │
│         │                          │                │
│         ▼                          ▼                │
│  ┌─────────────┐          ┌──────────────────┐      │
│  │ Prompt Cache│          │  SSE StreamingR. │      │
│  │  (LRU+TTL)  │          │  token-by-token  │      │
│  └──────┬──────┘          └────────┬─────────┘      │
│    HIT  │  MISS                   │                 │
│         ▼                         ▼                 │
│  ┌──────────────────────────────────────┐           │
│  │          Ollama HTTP Client          │           │
│  │     (async, connection-pooled)       │           │
│  └──────────────────┬───────────────────┘           │
│                     │                               │
│  GET /health        │  GET /metrics                 │
└─────────────────────┼───────────────────────────────┘
                      │
              ┌───────▼────────┐
              │  Ollama sidecar│
              │  phi3:mini     │
              │  (GPU offload) │
              └────────────────┘

Key design decisions

Prompt cache — SHA-256 keyed LRU cache with TTL. Identical requests (same prompt, temperature, max_tokens, system prompt, model) are served from memory without touching the model. Delivers sub-millisecond latency on cache hits vs multi-second cold inference.

Ollama backend — decouples model management from the API server. Ollama handles GPU memory, model loading, and GGUF quantisation. The inference server stays stateless and restartable without losing the loaded model.

Async throughout — httpx.AsyncClient with connection pooling for Ollama calls. FastAPI's async request handling means the server doesn't block threads during generation.

Prometheus metrics — request counters, latency histograms (cached vs cold), tokens/second, active request gauge, and cache size. All scraped by the bundled Prometheus instance and viewable in Grafana.

Project structure

llm-inference-server/
├── app/
│   ├── main.py                  # FastAPI app, lifespan, middleware
│   ├── core/
│   │   ├── config.py            # Pydantic settings (env-driven)
│   │   ├── ollama.py            # Async Ollama client
│   │   ├── cache.py             # LRU + TTL prompt cache
│   │   ├── metrics.py           # Prometheus metric definitions
│   │   └── schemas.py           # Pydantic request/response models
│   └── routes/
│       ├── generate.py          # /v1/generate + /v1/generate/stream
│       └── health.py            # /health
├── tests/
│   ├── test_cache.py            # 11 unit tests for the cache
│   └── test_generate.py         # Integration tests (mocked Ollama)
├── benchmarks/
│   └── latency_benchmark.py     # Cached vs cold latency comparison
├── Dockerfile
├── docker-compose.yml           # inference-server + ollama + prometheus + grafana
├── prometheus.yml
├── requirements.txt
├── requirements-dev.txt
└── .github/workflows/ci.yml     # pytest + docker build on every push

Quickstart

Prerequisites

Docker + Docker Compose
NVIDIA GPU with drivers installed (for GPU offload; CPU fallback works without)

Run the full stack

git clone https://github.com/CosmicAlgo/llm-inference-server
cd llm-inference-server
docker compose up --build

On first run Ollama pulls phi3:mini (~2.3 GB). Subsequent starts load from the ollama_models Docker volume.

Service	URL
Inference API	http://localhost:8000
API docs (Swagger)	http://localhost:8000/docs
Prometheus metrics	http://localhost:8000/metrics
Prometheus UI	http://localhost:9090
Grafana	http://localhost:3000

Run locally (no Docker)

# Requires Ollama running locally: https://ollama.com
ollama pull phi3:mini

pip install -r requirements.txt
uvicorn app.main:app --reload

API

`POST /v1/generate`

{
  "prompt": "Explain the attention mechanism in transformers.",
  "max_tokens": 256,
  "temperature": 0.7,
  "system": "You are a concise technical assistant.",
  "use_cache": true
}

Response:

{
  "text": "The attention mechanism allows each token...",
  "model": "phi3:mini",
  "cached": false,
  "latency_ms": 1847.3,
  "prompt_tokens_approx": 11,
  "completion_tokens_approx": 64
}

`POST /v1/generate/stream`

Same request body. Returns Server-Sent Events:

data: {"token": "The", "done": false, "cached": false}
data: {"token": " attention", "done": false, "cached": false}
...
data: {"token": "", "done": true, "latency_ms": 1923.1}

`GET /health`

{
  "status": "ok",
  "model": "phi3:mini",
  "ollama_reachable": true,
  "cache_stats": {
    "size": 12,
    "max_size": 512,
    "hits": 47,
    "misses": 31,
    "hit_rate": 0.6026,
    "ttl_seconds": 3600
  }
}

Benchmark results

Run against phi3:mini on RTX 4060 (8 GB VRAM), 10 cached requests per prompt:

Metric	Cold (no cache)	Cached	Speedup
Mean latency	2,341 ms	1.2 ms	1,951x
Median latency	2,218 ms	0.9 ms	2,464x
P95 latency	3,104 ms	2.1 ms	—

Run the benchmark yourself:

python benchmarks/latency_benchmark.py --url http://localhost:8000 --n 10

Configuration

All settings are environment variables (or .env file):

Variable	Default	Description
`MODEL_NAME`	`phi3:mini`	Ollama model tag
`OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama API URL
`CACHE_ENABLED`	`true`	Enable prompt cache
`CACHE_MAX_SIZE`	`512`	Max cached entries (LRU eviction)
`CACHE_TTL_SECONDS`	`3600`	Cache entry TTL
`MAX_TOKENS_DEFAULT`	`512`	Default max tokens
`MAX_TOKENS_LIMIT`	`2048`	Hard cap on max tokens
`TEMPERATURE_DEFAULT`	`0.7`	Default sampling temperature

Tests

pip install -r requirements-dev.txt
pytest tests/ -v --cov=app --cov-report=term-missing

21 tests covering cache correctness (LRU eviction, TTL expiry, key sensitivity), endpoint behaviour (cache hit/miss, streaming SSE format, error handling), and schema validation.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Inference Server

Architecture

Key design decisions

Project structure

Quickstart

Prerequisites

Run the full stack

Run locally (no Docker)

API

`POST /v1/generate`

`POST /v1/generate/stream`

`GET /health`

Benchmark results

Configuration

Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

LLM Inference Server

Architecture

Key design decisions

Project structure

Quickstart

Prerequisites

Run the full stack

Run locally (no Docker)

API

POST /v1/generate

POST /v1/generate/stream

GET /health

Benchmark results

Configuration

Tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

`POST /v1/generate`

`POST /v1/generate/stream`

`GET /health`

Packages