An LLM infrastructure toolkit designed to decouple inference workloads from AWS Bedrock by providing automated baseline benchmarking, multi-region failover, and graceful connection draining for self-hosted, disaggregated inference clusters.
Relying solely on managed services like AWS Bedrock for LLM inference is great for getting off the ground, but at scale, rate limits and vendor lock-in quickly become bottlenecks. This repository provides a complete infrastructure toolkit for running resilient, disaggregated LLM workloads independently of AWS Bedrock, while keeping it as a reliable fallback.
By breaking the AI generation process into specialized phases handled via message queues (AWS SQS), you can run compute-heavy "prefill" tasks on one pool of GPUs and memory-heavy "decode" tasks on another. To solve the problem of dropped connections during auto-scaling, we've included a dedicated "smart drain" sidecar that guarantees a seamless user experience.
- Bedrock Decoupling & Fallback: Tools to benchmark self-hosted clusters against Bedrock baselines, treating managed APIs as a failover rather than a hard dependency. Multi-region search scripts ensure you never hit a hard quota wall.
- Disaggregated SQS Routing: Splits the monolithic LLM request lifecycle. Uses SQS queues to independently scale compute-bound prefill nodes (e.g., EC2 T4s) and memory-bound decode nodes (e.g., Local RTX 4080s).
- Graceful Drain Sidecar (
drain.py): A lightweight utility that solves the "lost token" problem during ECS/Kubernetes scale-in events. It intercepts cluster termination signals (SIGTERM) and polls the inference engine's active slots, ensuring all streamed responses complete successfully before the container exits. - Datadog Integration: Built-in scripts to push custom LLM metrics (TTFT, TPS, in-flight requests) directly to Datadog for automated scaling control.
- Python 3.11
- AWS CLI configured with appropriate profiles (
aws-datadog-hack,kernelpro) - Access to AWS Bedrock (GLM-4.7 models enabled)
- (Optional) Datadog API keys for metric ingestion
Deploy drain.py alongside your inference container (e.g., vLLM or llama.cpp). It requires the inference server to expose a /slots health endpoint.
export INFERENCE_URL="http://localhost:8080"
export POLL_INTERVAL_SEC="1"
export DRAIN_TIMEOUT_SEC="300"
python3 project/sidecar/drain.pyTo establish your baseline and send metrics to Datadog:
python3 project/load-gen/add-bedrock-baseline.pyTo simulate a burst load across your disaggregated prefill and decode architecture:
python3 project/load-gen/add-demo-burst.py --prefill <PREFILL_URL> --decode <DECODE_URL>| File | Notes |
|---|---|
load-gen/burst.py |
Scenario A load generator β unchanged |
load-gen/add-demo-burst.py |
Scenario B disaggregated burst β unchanged |
load-gen/add-bedrock-baseline.py |
Datadog metric pump for baseline β unchanged |
sidecar/drain.py |
Polls worker-decode /slots instead of llama.cpp directly. Added POST /drain call to notify worker before polling. Log message polls /slots |
workers/worker-decode.py |
SQS consumer for decode pool. Exposes /slots and /drain. True state owner. |
workers/worker-prefill.py |
SQS consumer for stateless prefill pool on ECS T4. |
controller/controller.py |
Real autoscaler: reads Datadog TTFT p99 + CloudWatch SQS depth, drives ECS UpdateService, orchestrates graceful decode drain. |
router/router.py |
HTTP classifier β new prompt β prefill-queue, session continuation β decode-queue. |
dashboard/dashboard-setup.py |
Creates the Datadog demo dashboard via API. Run once before judges arrive. |
kvdrainguard/
βββ project/ β your repo root, run all docker/infra commands from here
β
βββ Dockerfile.prefill β builds the ECS prefill image
β FROM vllm/vllm-openai:v0.8.4
β adds worker-prefill.py + supervisord
β
βββ sidecar/
β βββ drain.py β SIGTERM handler, polls worker-decode:9090/slots
β
βββ controller/
β βββ controller.py β autoscaler: Datadog TTFT + CloudWatch SQS + ECS
β
βββ router/
β βββ router.py β HTTP :8090, classifies β SQS prefill or decode queue
β
βββ workers/
β βββ worker-decode.py β runs LOCAL on your 4080
β β SQS consumer, llama.cpp proxy
β β exposes :9090/slots and :9090/drain
β β drain.py polls this β NOT llama.cpp directly
β β
β βββ worker-prefill.py β runs INSIDE the ECS container on g4dn T4
β SQS consumer, vLLM proxy
β exposes :9091/slots
β
βββ load-gen/
β βββ burst.py β Scenario A generic load gen
β βββ add-demo-burst.py β Scenario B disaggregated prefill/decode burst
β βββ add-bedrock-baseline.py β Bedrock baseline + Datadog metric pump
β
βββ dashboard/
β βββ dashboard-setup.py β creates Datadog demo dashboard via API (run once)
β
βββ infra/
βββ setup-infra.sh β ONE COMMAND bootstrap:
β quota check β key pair β security group
β β IAM roles β SSM secrets β ECR repo
β β docker build+push β ECS cluster
β β EC2 g4dn.2xlarge launch β task def
β β ECS service create β CloudWatch log group
β
βββ task-definition-prefill.json
β β ECS task def: 8 vCPU, 28GB RAM, 1 GPU
β pulls secrets from SSM Parameter Store
β mounts /mnt/model-cache EBS volume
β
βββ supervisord.prefill.conf β process manager inside the container
β vLLM flags for T4 SM75:
β --enforce-eager (no CUDA graphs)
β --dtype float16 (no bfloat16 on T4)
β --attention-backend xformers
β
βββ .env.prefill.template β env vars for local Docker testing
Port map
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Component Host Port Endpoint
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
llama.cpp (4080) localhost (WSL) 8080 /v1/completions
worker-decode localhost (WSL) 9090 /slots /drain
router localhost (WSL) 8090 /v1/completions
cloudflare tunnel β exposes 8080 to internet
vLLM (T4 ECS) EC2 public IP 8000 /v1/completions
worker-prefill EC2 ECS sidecar 9091 /slots
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
What runs where
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
LOCAL (your WSL machine / RTX 4080):
llama.cpp server :8080
worker-decode.py :9090
drain.py (no port, polls worker-decode)
router.py :8090
cloudflared tunnel (exposes :8080)
controller.py (CPU only)
REMOTE (g4dn.2xlarge ECS container):
vLLM server :8000 (inside container, via supervisord)
worker-prefill.py :9091 (inside container, via supervisord)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Port | Endpoint |
|---|---|---|
| worker-decode | 9090 | GET /slots, POST /drain |
| worker-prefill | 9091 | GET /slots, GET /health |
| router | 8090 | POST /v1/completions, GET /health |
| llama.cpp (local 4080) | 8080 | POST /v1/completions (via Cloudflare Tunnel) |
| vLLM (EC2 T4) | 8000 | POST /v1/completions |
| drain sidecar | β | no port; polls worker-decode:9090 |
SQS_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/decode-queue
LLAMACPP_URL=http://localhost:8080
DD_API_KEY=<your key>
SQS_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/prefill-queue
VLLM_URL=http://localhost:8000
DD_API_KEY=<your key>
DD_API_KEY=<your key>
DD_APP_KEY=<your app key>
ECS_CLUSTER=kvdrainguard
PREFILL_SERVICE=kvdrainguard-prefill-service
DECODE_WORKER_URL=http://localhost:9090
PREFILL_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/prefill-queue
DECODE_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/decode-queue
WORKER_URL=http://localhost:9090
PREFILL_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/prefill-queue
DECODE_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/decode-queue
# Terminal 1 β llama.cpp already running on 4080 at :8080
# Terminal 2 β Cloudflare tunnel
cloudflared tunnel --url http://localhost:8080
# Terminal 3 β decode worker (exposes /slots and /drain to drain.py)
SQS_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/decode-queue \
python3 workers/worker-decode.py
# Terminal 4 β drain sidecar (polls worker-decode, not llama.cpp)
WORKER_URL=http://localhost:9090 python3 sidecar/drain.py &
# Terminal 5 β router
PREFILL_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/prefill-queue \
DECODE_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/decode-queue \
python3 router/router.py
# Terminal 6 β controller
DD_API_KEY=... DD_APP_KEY=... ECS_CLUSTER=kvdrainguard \
PREFILL_SERVICE=kvdrainguard-prefill-service \
DECODE_WORKER_URL=http://localhost:9090 \
python3 controller/controller.py
# One-time: create Datadog dashboard
DD_API_KEY=... DD_APP_KEY=... python3 dashboard/dashboard-setup.pyController detects: decode_depth=0, ttft < 0.5s
β POST /drain to worker-decode:9090
β worker-decode sets is_draining=True, stops SQS long-poll
β in-flight llama.cpp request completes
β worker-decode reports is_processing=False
β drain.py polls /slots β sees 0 in-flight
β drain.py exits cleanly (sys.exit 0)
β ECS terminates task (no dropped KV cache)
β controller.drain_succeeded metric β 1 on Datadog dashboard