Author: Rahul Surya Last Edit Date: 20th March 2026
Production-grade REST API for transformer model inference. Serves Phi-3-mini via Ollama with a prompt-level KV cache, Server-Sent Events streaming, Prometheus metrics, and a full Docker Compose stack.
┌─────────────────────────────────────────────────────┐
│ FastAPI Application │
│ │
│ POST /v1/generate POST /v1/generate/stream │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌──────────────────┐ │
│ │ Prompt Cache│ │ SSE StreamingR. │ │
│ │ (LRU+TTL) │ │ token-by-token │ │
│ └──────┬──────┘ └────────┬─────────┘ │
│ HIT │ MISS │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ Ollama HTTP Client │ │
│ │ (async, connection-pooled) │ │
│ └──────────────────┬───────────────────┘ │
│ │ │
│ GET /health │ GET /metrics │
└─────────────────────┼───────────────────────────────┘
│
┌───────▼────────┐
│ Ollama sidecar│
│ phi3:mini │
│ (GPU offload) │
└────────────────┘
Prompt cache — SHA-256 keyed LRU cache with TTL. Identical requests (same prompt, temperature, max_tokens, system prompt, model) are served from memory without touching the model. Delivers sub-millisecond latency on cache hits vs multi-second cold inference.
Ollama backend — decouples model management from the API server. Ollama handles GPU memory, model loading, and GGUF quantisation. The inference server stays stateless and restartable without losing the loaded model.
Async throughout — httpx.AsyncClient with connection pooling for Ollama calls. FastAPI's async request handling means the server doesn't block threads during generation.
Prometheus metrics — request counters, latency histograms (cached vs cold), tokens/second, active request gauge, and cache size. All scraped by the bundled Prometheus instance and viewable in Grafana.
llm-inference-server/
├── app/
│ ├── main.py # FastAPI app, lifespan, middleware
│ ├── core/
│ │ ├── config.py # Pydantic settings (env-driven)
│ │ ├── ollama.py # Async Ollama client
│ │ ├── cache.py # LRU + TTL prompt cache
│ │ ├── metrics.py # Prometheus metric definitions
│ │ └── schemas.py # Pydantic request/response models
│ └── routes/
│ ├── generate.py # /v1/generate + /v1/generate/stream
│ └── health.py # /health
├── tests/
│ ├── test_cache.py # 11 unit tests for the cache
│ └── test_generate.py # Integration tests (mocked Ollama)
├── benchmarks/
│ └── latency_benchmark.py # Cached vs cold latency comparison
├── Dockerfile
├── docker-compose.yml # inference-server + ollama + prometheus + grafana
├── prometheus.yml
├── requirements.txt
├── requirements-dev.txt
└── .github/workflows/ci.yml # pytest + docker build on every push
- Docker + Docker Compose
- NVIDIA GPU with drivers installed (for GPU offload; CPU fallback works without)
git clone https://github.com/CosmicAlgo/llm-inference-server
cd llm-inference-server
docker compose up --buildOn first run Ollama pulls phi3:mini (~2.3 GB). Subsequent starts load from the ollama_models Docker volume.
| Service | URL |
|---|---|
| Inference API | http://localhost:8000 |
| API docs (Swagger) | http://localhost:8000/docs |
| Prometheus metrics | http://localhost:8000/metrics |
| Prometheus UI | http://localhost:9090 |
| Grafana | http://localhost:3000 |
# Requires Ollama running locally: https://ollama.com
ollama pull phi3:mini
pip install -r requirements.txt
uvicorn app.main:app --reload{
"prompt": "Explain the attention mechanism in transformers.",
"max_tokens": 256,
"temperature": 0.7,
"system": "You are a concise technical assistant.",
"use_cache": true
}Response:
{
"text": "The attention mechanism allows each token...",
"model": "phi3:mini",
"cached": false,
"latency_ms": 1847.3,
"prompt_tokens_approx": 11,
"completion_tokens_approx": 64
}Same request body. Returns Server-Sent Events:
data: {"token": "The", "done": false, "cached": false}
data: {"token": " attention", "done": false, "cached": false}
...
data: {"token": "", "done": true, "latency_ms": 1923.1}
{
"status": "ok",
"model": "phi3:mini",
"ollama_reachable": true,
"cache_stats": {
"size": 12,
"max_size": 512,
"hits": 47,
"misses": 31,
"hit_rate": 0.6026,
"ttl_seconds": 3600
}
}Run against phi3:mini on RTX 4060 (8 GB VRAM), 10 cached requests per prompt:
| Metric | Cold (no cache) | Cached | Speedup |
|---|---|---|---|
| Mean latency | 2,341 ms | 1.2 ms | 1,951x |
| Median latency | 2,218 ms | 0.9 ms | 2,464x |
| P95 latency | 3,104 ms | 2.1 ms | — |
Run the benchmark yourself:
python benchmarks/latency_benchmark.py --url http://localhost:8000 --n 10All settings are environment variables (or .env file):
| Variable | Default | Description |
|---|---|---|
MODEL_NAME |
phi3:mini |
Ollama model tag |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama API URL |
CACHE_ENABLED |
true |
Enable prompt cache |
CACHE_MAX_SIZE |
512 |
Max cached entries (LRU eviction) |
CACHE_TTL_SECONDS |
3600 |
Cache entry TTL |
MAX_TOKENS_DEFAULT |
512 |
Default max tokens |
MAX_TOKENS_LIMIT |
2048 |
Hard cap on max tokens |
TEMPERATURE_DEFAULT |
0.7 |
Default sampling temperature |
pip install -r requirements-dev.txt
pytest tests/ -v --cov=app --cov-report=term-missing21 tests covering cache correctness (LRU eviction, TTL expiry, key sensitivity), endpoint behaviour (cache hit/miss, streaming SSE format, error handling), and schema validation.