Skip to content

CosmicAlgo/LLM-Inference-Server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Inference Server

Author: Rahul Surya Last Edit Date: 20th March 2026

CI Python FastAPI Docker

Production-grade REST API for transformer model inference. Serves Phi-3-mini via Ollama with a prompt-level KV cache, Server-Sent Events streaming, Prometheus metrics, and a full Docker Compose stack.


Architecture

┌─────────────────────────────────────────────────────┐
│                  FastAPI Application                │
│                                                     │
│  POST /v1/generate         POST /v1/generate/stream │
│         │                          │                │
│         ▼                          ▼                │
│  ┌─────────────┐          ┌──────────────────┐      │
│  │ Prompt Cache│          │  SSE StreamingR. │      │
│  │  (LRU+TTL)  │          │  token-by-token  │      │
│  └──────┬──────┘          └────────┬─────────┘      │
│    HIT  │  MISS                   │                 │
│         ▼                         ▼                 │
│  ┌──────────────────────────────────────┐           │
│  │          Ollama HTTP Client          │           │
│  │     (async, connection-pooled)       │           │
│  └──────────────────┬───────────────────┘           │
│                     │                               │
│  GET /health        │  GET /metrics                 │
└─────────────────────┼───────────────────────────────┘
                      │
              ┌───────▼────────┐
              │  Ollama sidecar│
              │  phi3:mini     │
              │  (GPU offload) │
              └────────────────┘

Key design decisions

Prompt cache — SHA-256 keyed LRU cache with TTL. Identical requests (same prompt, temperature, max_tokens, system prompt, model) are served from memory without touching the model. Delivers sub-millisecond latency on cache hits vs multi-second cold inference.

Ollama backend — decouples model management from the API server. Ollama handles GPU memory, model loading, and GGUF quantisation. The inference server stays stateless and restartable without losing the loaded model.

Async throughouthttpx.AsyncClient with connection pooling for Ollama calls. FastAPI's async request handling means the server doesn't block threads during generation.

Prometheus metrics — request counters, latency histograms (cached vs cold), tokens/second, active request gauge, and cache size. All scraped by the bundled Prometheus instance and viewable in Grafana.


Project structure

llm-inference-server/
├── app/
│   ├── main.py                  # FastAPI app, lifespan, middleware
│   ├── core/
│   │   ├── config.py            # Pydantic settings (env-driven)
│   │   ├── ollama.py            # Async Ollama client
│   │   ├── cache.py             # LRU + TTL prompt cache
│   │   ├── metrics.py           # Prometheus metric definitions
│   │   └── schemas.py           # Pydantic request/response models
│   └── routes/
│       ├── generate.py          # /v1/generate + /v1/generate/stream
│       └── health.py            # /health
├── tests/
│   ├── test_cache.py            # 11 unit tests for the cache
│   └── test_generate.py         # Integration tests (mocked Ollama)
├── benchmarks/
│   └── latency_benchmark.py     # Cached vs cold latency comparison
├── Dockerfile
├── docker-compose.yml           # inference-server + ollama + prometheus + grafana
├── prometheus.yml
├── requirements.txt
├── requirements-dev.txt
└── .github/workflows/ci.yml     # pytest + docker build on every push

Quickstart

Prerequisites

  • Docker + Docker Compose
  • NVIDIA GPU with drivers installed (for GPU offload; CPU fallback works without)

Run the full stack

git clone https://github.com/CosmicAlgo/llm-inference-server
cd llm-inference-server
docker compose up --build

On first run Ollama pulls phi3:mini (~2.3 GB). Subsequent starts load from the ollama_models Docker volume.

Service URL
Inference API http://localhost:8000
API docs (Swagger) http://localhost:8000/docs
Prometheus metrics http://localhost:8000/metrics
Prometheus UI http://localhost:9090
Grafana http://localhost:3000

Run locally (no Docker)

# Requires Ollama running locally: https://ollama.com
ollama pull phi3:mini

pip install -r requirements.txt
uvicorn app.main:app --reload

API

POST /v1/generate

{
  "prompt": "Explain the attention mechanism in transformers.",
  "max_tokens": 256,
  "temperature": 0.7,
  "system": "You are a concise technical assistant.",
  "use_cache": true
}

Response:

{
  "text": "The attention mechanism allows each token...",
  "model": "phi3:mini",
  "cached": false,
  "latency_ms": 1847.3,
  "prompt_tokens_approx": 11,
  "completion_tokens_approx": 64
}

POST /v1/generate/stream

Same request body. Returns Server-Sent Events:

data: {"token": "The", "done": false, "cached": false}
data: {"token": " attention", "done": false, "cached": false}
...
data: {"token": "", "done": true, "latency_ms": 1923.1}

GET /health

{
  "status": "ok",
  "model": "phi3:mini",
  "ollama_reachable": true,
  "cache_stats": {
    "size": 12,
    "max_size": 512,
    "hits": 47,
    "misses": 31,
    "hit_rate": 0.6026,
    "ttl_seconds": 3600
  }
}

Benchmark results

Run against phi3:mini on RTX 4060 (8 GB VRAM), 10 cached requests per prompt:

Metric Cold (no cache) Cached Speedup
Mean latency 2,341 ms 1.2 ms 1,951x
Median latency 2,218 ms 0.9 ms 2,464x
P95 latency 3,104 ms 2.1 ms

Run the benchmark yourself:

python benchmarks/latency_benchmark.py --url http://localhost:8000 --n 10

Configuration

All settings are environment variables (or .env file):

Variable Default Description
MODEL_NAME phi3:mini Ollama model tag
OLLAMA_BASE_URL http://localhost:11434 Ollama API URL
CACHE_ENABLED true Enable prompt cache
CACHE_MAX_SIZE 512 Max cached entries (LRU eviction)
CACHE_TTL_SECONDS 3600 Cache entry TTL
MAX_TOKENS_DEFAULT 512 Default max tokens
MAX_TOKENS_LIMIT 2048 Hard cap on max tokens
TEMPERATURE_DEFAULT 0.7 Default sampling temperature

Tests

pip install -r requirements-dev.txt
pytest tests/ -v --cov=app --cov-report=term-missing

21 tests covering cache correctness (LRU eviction, TTL expiry, key sensitivity), endpoint behaviour (cache hit/miss, streaming SSE format, error handling), and schema validation.

About

Production REST API for LLM inference, KV cache, SSE streaming, Prometheus metrics, Dockerised

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors