KVDrainGuard: Disaggregated Inference & Bedrock Decoupling

An LLM infrastructure toolkit designed to decouple inference workloads from AWS Bedrock by providing automated baseline benchmarking, multi-region failover, and graceful connection draining for self-hosted, disaggregated inference clusters.

📖 Overview

Relying solely on managed services like AWS Bedrock for LLM inference is great for getting off the ground, but at scale, rate limits and vendor lock-in quickly become bottlenecks. This repository provides a complete infrastructure toolkit for running resilient, disaggregated LLM workloads independently of AWS Bedrock, while keeping it as a reliable fallback.

By breaking the AI generation process into specialized phases handled via message queues (AWS SQS), you can run compute-heavy "prefill" tasks on one pool of GPUs and memory-heavy "decode" tasks on another. To solve the problem of dropped connections during auto-scaling, we've included a dedicated "smart drain" sidecar that guarantees a seamless user experience.

✨ Key Features

Bedrock Decoupling & Fallback: Tools to benchmark self-hosted clusters against Bedrock baselines, treating managed APIs as a failover rather than a hard dependency. Multi-region search scripts ensure you never hit a hard quota wall.
Disaggregated SQS Routing: Splits the monolithic LLM request lifecycle. Uses SQS queues to independently scale compute-bound prefill nodes (e.g., EC2 T4s) and memory-bound decode nodes (e.g., Local RTX 4080s).
Graceful Drain Sidecar (drain.py): A lightweight utility that solves the "lost token" problem during ECS/Kubernetes scale-in events. It intercepts cluster termination signals (SIGTERM) and polls the inference engine's active slots, ensuring all streamed responses complete successfully before the container exits.
Datadog Integration: Built-in scripts to push custom LLM metrics (TTFT, TPS, in-flight requests) directly to Datadog for automated scaling control.

🚀 Getting Started

Prerequisites

Python 3.11
AWS CLI configured with appropriate profiles (aws-datadog-hack, kernelpro)
Access to AWS Bedrock (GLM-4.7 models enabled)
(Optional) Datadog API keys for metric ingestion

Running the Graceful Drain Sidecar

Deploy drain.py alongside your inference container (e.g., vLLM or llama.cpp). It requires the inference server to expose a /slots health endpoint.

export INFERENCE_URL="http://localhost:8080"
export POLL_INTERVAL_SEC="1"
export DRAIN_TIMEOUT_SEC="300"

python3 project/sidecar/drain.py

Benchmarking Bedrock vs. Self-Hosted

To establish your baseline and send metrics to Datadog:

python3 project/load-gen/add-bedrock-baseline.py

To simulate a burst load across your disaggregated prefill and decode architecture:

python3 project/load-gen/add-demo-burst.py --prefill <PREFILL_URL> --decode <DECODE_URL>

Contains:

File	Notes
`load-gen/burst.py`	Scenario A load generator — unchanged
`load-gen/add-demo-burst.py`	Scenario B disaggregated burst — unchanged
`load-gen/add-bedrock-baseline.py`	Datadog metric pump for baseline — unchanged
`sidecar/drain.py`	Polls `worker-decode /slots` instead of llama.cpp directly. Added `POST /drain` call to notify worker before polling. Log message polls /slots
`workers/worker-decode.py`	SQS consumer for decode pool. Exposes `/slots` and `/drain`. True state owner.
`workers/worker-prefill.py`	SQS consumer for stateless prefill pool on ECS T4.
`controller/controller.py`	Real autoscaler: reads Datadog TTFT p99 + CloudWatch SQS depth, drives ECS UpdateService, orchestrates graceful decode drain.
`router/router.py`	HTTP classifier — new prompt → prefill-queue, session continuation → decode-queue.
`dashboard/dashboard-setup.py`	Creates the Datadog demo dashboard via API. Run once before judges arrive.

Full directory structure


kvdrainguard/
└── project/                          ← your repo root, run all docker/infra commands from here
    │
    ├── Dockerfile.prefill            ← builds the ECS prefill image
    │                                    FROM vllm/vllm-openai:v0.8.4
    │                                    adds worker-prefill.py + supervisord
    │
    ├── sidecar/
    │   └── drain.py                  ← SIGTERM handler, polls worker-decode:9090/slots
    │
    ├── controller/
    │   └── controller.py             ← autoscaler: Datadog TTFT + CloudWatch SQS + ECS
    │
    ├── router/
    │   └── router.py                 ← HTTP :8090, classifies → SQS prefill or decode queue
    │
    ├── workers/
    │   ├── worker-decode.py          ← runs LOCAL on your 4080
    │   │                                SQS consumer, llama.cpp proxy
    │   │                                exposes :9090/slots and :9090/drain
    │   │                                drain.py polls this — NOT llama.cpp directly
    │   │
    │   └── worker-prefill.py         ← runs INSIDE the ECS container on g4dn T4
    │                                    SQS consumer, vLLM proxy
    │                                    exposes :9091/slots
    │
    ├── load-gen/
    │   ├── burst.py                  ← Scenario A generic load gen
    │   ├── add-demo-burst.py         ← Scenario B disaggregated prefill/decode burst
    │   └── add-bedrock-baseline.py   ← Bedrock baseline + Datadog metric pump
    │
    ├── dashboard/
    │   └── dashboard-setup.py        ← creates Datadog demo dashboard via API (run once)
    │
    └── infra/
        ├── setup-infra.sh            ← ONE COMMAND bootstrap:
        │                                quota check → key pair → security group
        │                                → IAM roles → SSM secrets → ECR repo
        │                                → docker build+push → ECS cluster
        │                                → EC2 g4dn.2xlarge launch → task def
        │                                → ECS service create → CloudWatch log group
        │
        ├── task-definition-prefill.json
        │                             ← ECS task def: 8 vCPU, 28GB RAM, 1 GPU
        │                                pulls secrets from SSM Parameter Store
        │                                mounts /mnt/model-cache EBS volume
        │
        ├── supervisord.prefill.conf  ← process manager inside the container
        │                                vLLM flags for T4 SM75:
        │                                  --enforce-eager (no CUDA graphs)
        │                                  --dtype float16 (no bfloat16 on T4)
        │                                  --attention-backend xformers
        │
        └── .env.prefill.template     ← env vars for local Docker testing


Port map
────────────────────────────────────────────────────────────
Component            Host              Port   Endpoint
────────────────────────────────────────────────────────────
llama.cpp (4080)     localhost (WSL)   8080   /v1/completions
worker-decode        localhost (WSL)   9090   /slots  /drain
router               localhost (WSL)   8090   /v1/completions
cloudflare tunnel    →                        exposes 8080 to internet
vLLM (T4 ECS)        EC2 public IP    8000   /v1/completions
worker-prefill       EC2 ECS sidecar  9091   /slots
────────────────────────────────────────────────────────────


What runs where
────────────────────────────────────────────────────────────
LOCAL (your WSL machine / RTX 4080):
  llama.cpp server          :8080
  worker-decode.py          :9090
  drain.py                  (no port, polls worker-decode)
  router.py                 :8090
  cloudflared tunnel        (exposes :8080)
  controller.py             (CPU only)

REMOTE (g4dn.2xlarge ECS container):
  vLLM server               :8000  (inside container, via supervisord)
  worker-prefill.py         :9091  (inside container, via supervisord)
────────────────────────────────────────────────────────────

Port map

Component	Port	Endpoint
worker-decode	9090	`GET /slots`, `POST /drain`
worker-prefill	9091	`GET /slots`, `GET /health`
router	8090	`POST /v1/completions`, `GET /health`
llama.cpp (local 4080)	8080	`POST /v1/completions` (via Cloudflare Tunnel)
vLLM (EC2 T4)	8000	`POST /v1/completions`
drain sidecar	—	no port; polls worker-decode:9090

Required ENV vars

worker-decode

SQS_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/decode-queue
LLAMACPP_URL=http://localhost:8080
DD_API_KEY=<your key>

worker-prefill

SQS_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/prefill-queue
VLLM_URL=http://localhost:8000
DD_API_KEY=<your key>

controller

DD_API_KEY=<your key>
DD_APP_KEY=<your app key>
ECS_CLUSTER=kvdrainguard
PREFILL_SERVICE=kvdrainguard-prefill-service
DECODE_WORKER_URL=http://localhost:9090
PREFILL_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/prefill-queue
DECODE_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/decode-queue

drain sidecar

WORKER_URL=http://localhost:9090

router

PREFILL_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/prefill-queue
DECODE_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/decode-queue

Startup order (local demo)

# Terminal 1 — llama.cpp already running on 4080 at :8080
# Terminal 2 — Cloudflare tunnel
cloudflared tunnel --url http://localhost:8080

# Terminal 3 — decode worker (exposes /slots and /drain to drain.py)
SQS_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/decode-queue \
  python3 workers/worker-decode.py

# Terminal 4 — drain sidecar (polls worker-decode, not llama.cpp)
WORKER_URL=http://localhost:9090 python3 sidecar/drain.py &

# Terminal 5 — router
PREFILL_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/prefill-queue \
DECODE_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/decode-queue \
  python3 router/router.py

# Terminal 6 — controller
DD_API_KEY=... DD_APP_KEY=... ECS_CLUSTER=kvdrainguard \
PREFILL_SERVICE=kvdrainguard-prefill-service \
DECODE_WORKER_URL=http://localhost:9090 \
  python3 controller/controller.py

# One-time: create Datadog dashboard
DD_API_KEY=... DD_APP_KEY=... python3 dashboard/dashboard-setup.py

Drain sequence (what judges will see)

Controller detects: decode_depth=0, ttft < 0.5s
  → POST /drain to worker-decode:9090
  → worker-decode sets is_draining=True, stops SQS long-poll
  → in-flight llama.cpp request completes
  → worker-decode reports is_processing=False
  → drain.py polls /slots → sees 0 in-flight
  → drain.py exits cleanly (sys.exit 0)
  → ECS terminates task (no dropped KV cache)
  → controller.drain_succeeded metric → 1 on Datadog dashboard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KVDrainGuard: Disaggregated Inference & Bedrock Decoupling

📖 Overview

✨ Key Features

🚀 Getting Started

Prerequisites

Running the Graceful Drain Sidecar

Benchmarking Bedrock vs. Self-Hosted

Contains:

Full directory structure

Port map

Required ENV vars

worker-decode

worker-prefill

controller

drain sidecar

router

Startup order (local demo)

Drain sequence (what judges will see)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
controller		controller
infra		infra
load-gen		load-gen
router		router
run-scripts		run-scripts
sidecar		sidecar
utilities		utilities
workers		workers
Dockerfile.prefill		Dockerfile.prefill
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

KVDrainGuard: Disaggregated Inference & Bedrock Decoupling

📖 Overview

✨ Key Features

🚀 Getting Started

Prerequisites

Running the Graceful Drain Sidecar

Benchmarking Bedrock vs. Self-Hosted

Contains:

Full directory structure

Port map

Required ENV vars

worker-decode

worker-prefill

controller

drain sidecar

router

Startup order (local demo)

Drain sequence (what judges will see)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages