Skip to content

LilySu/llm-inference-gateway-aws

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

KVDrainGuard: Disaggregated Inference & Bedrock Decoupling

An LLM infrastructure toolkit designed to decouple inference workloads from AWS Bedrock by providing automated baseline benchmarking, multi-region failover, and graceful connection draining for self-hosted, disaggregated inference clusters.

πŸ“– Overview

Relying solely on managed services like AWS Bedrock for LLM inference is great for getting off the ground, but at scale, rate limits and vendor lock-in quickly become bottlenecks. This repository provides a complete infrastructure toolkit for running resilient, disaggregated LLM workloads independently of AWS Bedrock, while keeping it as a reliable fallback.

By breaking the AI generation process into specialized phases handled via message queues (AWS SQS), you can run compute-heavy "prefill" tasks on one pool of GPUs and memory-heavy "decode" tasks on another. To solve the problem of dropped connections during auto-scaling, we've included a dedicated "smart drain" sidecar that guarantees a seamless user experience.

✨ Key Features

  • Bedrock Decoupling & Fallback: Tools to benchmark self-hosted clusters against Bedrock baselines, treating managed APIs as a failover rather than a hard dependency. Multi-region search scripts ensure you never hit a hard quota wall.
  • Disaggregated SQS Routing: Splits the monolithic LLM request lifecycle. Uses SQS queues to independently scale compute-bound prefill nodes (e.g., EC2 T4s) and memory-bound decode nodes (e.g., Local RTX 4080s).
  • Graceful Drain Sidecar (drain.py): A lightweight utility that solves the "lost token" problem during ECS/Kubernetes scale-in events. It intercepts cluster termination signals (SIGTERM) and polls the inference engine's active slots, ensuring all streamed responses complete successfully before the container exits.
  • Datadog Integration: Built-in scripts to push custom LLM metrics (TTFT, TPS, in-flight requests) directly to Datadog for automated scaling control.

πŸš€ Getting Started

Prerequisites

  • Python 3.11
  • AWS CLI configured with appropriate profiles (aws-datadog-hack, kernelpro)
  • Access to AWS Bedrock (GLM-4.7 models enabled)
  • (Optional) Datadog API keys for metric ingestion

Running the Graceful Drain Sidecar

Deploy drain.py alongside your inference container (e.g., vLLM or llama.cpp). It requires the inference server to expose a /slots health endpoint.

export INFERENCE_URL="http://localhost:8080"
export POLL_INTERVAL_SEC="1"
export DRAIN_TIMEOUT_SEC="300"

python3 project/sidecar/drain.py

Benchmarking Bedrock vs. Self-Hosted

To establish your baseline and send metrics to Datadog:

python3 project/load-gen/add-bedrock-baseline.py

To simulate a burst load across your disaggregated prefill and decode architecture:

python3 project/load-gen/add-demo-burst.py --prefill <PREFILL_URL> --decode <DECODE_URL>

Contains:

File Notes
load-gen/burst.py Scenario A load generator β€” unchanged
load-gen/add-demo-burst.py Scenario B disaggregated burst β€” unchanged
load-gen/add-bedrock-baseline.py Datadog metric pump for baseline β€” unchanged
sidecar/drain.py Polls worker-decode /slots instead of llama.cpp directly. Added POST /drain call to notify worker before polling. Log message polls /slots
workers/worker-decode.py SQS consumer for decode pool. Exposes /slots and /drain. True state owner.
workers/worker-prefill.py SQS consumer for stateless prefill pool on ECS T4.
controller/controller.py Real autoscaler: reads Datadog TTFT p99 + CloudWatch SQS depth, drives ECS UpdateService, orchestrates graceful decode drain.
router/router.py HTTP classifier β€” new prompt β†’ prefill-queue, session continuation β†’ decode-queue.
dashboard/dashboard-setup.py Creates the Datadog demo dashboard via API. Run once before judges arrive.

Full directory structure


kvdrainguard/
└── project/                          ← your repo root, run all docker/infra commands from here
    β”‚
    β”œβ”€β”€ Dockerfile.prefill            ← builds the ECS prefill image
    β”‚                                    FROM vllm/vllm-openai:v0.8.4
    β”‚                                    adds worker-prefill.py + supervisord
    β”‚
    β”œβ”€β”€ sidecar/
    β”‚   └── drain.py                  ← SIGTERM handler, polls worker-decode:9090/slots
    β”‚
    β”œβ”€β”€ controller/
    β”‚   └── controller.py             ← autoscaler: Datadog TTFT + CloudWatch SQS + ECS
    β”‚
    β”œβ”€β”€ router/
    β”‚   └── router.py                 ← HTTP :8090, classifies β†’ SQS prefill or decode queue
    β”‚
    β”œβ”€β”€ workers/
    β”‚   β”œβ”€β”€ worker-decode.py          ← runs LOCAL on your 4080
    β”‚   β”‚                                SQS consumer, llama.cpp proxy
    β”‚   β”‚                                exposes :9090/slots and :9090/drain
    β”‚   β”‚                                drain.py polls this β€” NOT llama.cpp directly
    β”‚   β”‚
    β”‚   └── worker-prefill.py         ← runs INSIDE the ECS container on g4dn T4
    β”‚                                    SQS consumer, vLLM proxy
    β”‚                                    exposes :9091/slots
    β”‚
    β”œβ”€β”€ load-gen/
    β”‚   β”œβ”€β”€ burst.py                  ← Scenario A generic load gen
    β”‚   β”œβ”€β”€ add-demo-burst.py         ← Scenario B disaggregated prefill/decode burst
    β”‚   └── add-bedrock-baseline.py   ← Bedrock baseline + Datadog metric pump
    β”‚
    β”œβ”€β”€ dashboard/
    β”‚   └── dashboard-setup.py        ← creates Datadog demo dashboard via API (run once)
    β”‚
    └── infra/
        β”œβ”€β”€ setup-infra.sh            ← ONE COMMAND bootstrap:
        β”‚                                quota check β†’ key pair β†’ security group
        β”‚                                β†’ IAM roles β†’ SSM secrets β†’ ECR repo
        β”‚                                β†’ docker build+push β†’ ECS cluster
        β”‚                                β†’ EC2 g4dn.2xlarge launch β†’ task def
        β”‚                                β†’ ECS service create β†’ CloudWatch log group
        β”‚
        β”œβ”€β”€ task-definition-prefill.json
        β”‚                             ← ECS task def: 8 vCPU, 28GB RAM, 1 GPU
        β”‚                                pulls secrets from SSM Parameter Store
        β”‚                                mounts /mnt/model-cache EBS volume
        β”‚
        β”œβ”€β”€ supervisord.prefill.conf  ← process manager inside the container
        β”‚                                vLLM flags for T4 SM75:
        β”‚                                  --enforce-eager (no CUDA graphs)
        β”‚                                  --dtype float16 (no bfloat16 on T4)
        β”‚                                  --attention-backend xformers
        β”‚
        └── .env.prefill.template     ← env vars for local Docker testing


Port map
────────────────────────────────────────────────────────────
Component            Host              Port   Endpoint
────────────────────────────────────────────────────────────
llama.cpp (4080)     localhost (WSL)   8080   /v1/completions
worker-decode        localhost (WSL)   9090   /slots  /drain
router               localhost (WSL)   8090   /v1/completions
cloudflare tunnel    β†’                        exposes 8080 to internet
vLLM (T4 ECS)        EC2 public IP    8000   /v1/completions
worker-prefill       EC2 ECS sidecar  9091   /slots
────────────────────────────────────────────────────────────


What runs where
────────────────────────────────────────────────────────────
LOCAL (your WSL machine / RTX 4080):
  llama.cpp server          :8080
  worker-decode.py          :9090
  drain.py                  (no port, polls worker-decode)
  router.py                 :8090
  cloudflared tunnel        (exposes :8080)
  controller.py             (CPU only)

REMOTE (g4dn.2xlarge ECS container):
  vLLM server               :8000  (inside container, via supervisord)
  worker-prefill.py         :9091  (inside container, via supervisord)
────────────────────────────────────────────────────────────

Port map

Component Port Endpoint
worker-decode 9090 GET /slots, POST /drain
worker-prefill 9091 GET /slots, GET /health
router 8090 POST /v1/completions, GET /health
llama.cpp (local 4080) 8080 POST /v1/completions (via Cloudflare Tunnel)
vLLM (EC2 T4) 8000 POST /v1/completions
drain sidecar β€” no port; polls worker-decode:9090

Required ENV vars

worker-decode

SQS_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/decode-queue
LLAMACPP_URL=http://localhost:8080
DD_API_KEY=<your key>

worker-prefill

SQS_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/prefill-queue
VLLM_URL=http://localhost:8000
DD_API_KEY=<your key>

controller

DD_API_KEY=<your key>
DD_APP_KEY=<your app key>
ECS_CLUSTER=kvdrainguard
PREFILL_SERVICE=kvdrainguard-prefill-service
DECODE_WORKER_URL=http://localhost:9090
PREFILL_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/prefill-queue
DECODE_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/decode-queue

drain sidecar

WORKER_URL=http://localhost:9090

router

PREFILL_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/prefill-queue
DECODE_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/decode-queue

Startup order (local demo)

# Terminal 1 β€” llama.cpp already running on 4080 at :8080
# Terminal 2 β€” Cloudflare tunnel
cloudflared tunnel --url http://localhost:8080

# Terminal 3 β€” decode worker (exposes /slots and /drain to drain.py)
SQS_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/decode-queue \
  python3 workers/worker-decode.py

# Terminal 4 β€” drain sidecar (polls worker-decode, not llama.cpp)
WORKER_URL=http://localhost:9090 python3 sidecar/drain.py &

# Terminal 5 β€” router
PREFILL_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/prefill-queue \
DECODE_QUEUE_URL=https://sqs.us-west-2.amazonaws.com/264263332155/decode-queue \
  python3 router/router.py

# Terminal 6 β€” controller
DD_API_KEY=... DD_APP_KEY=... ECS_CLUSTER=kvdrainguard \
PREFILL_SERVICE=kvdrainguard-prefill-service \
DECODE_WORKER_URL=http://localhost:9090 \
  python3 controller/controller.py

# One-time: create Datadog dashboard
DD_API_KEY=... DD_APP_KEY=... python3 dashboard/dashboard-setup.py

Drain sequence (what judges will see)

Controller detects: decode_depth=0, ttft < 0.5s
  β†’ POST /drain to worker-decode:9090
  β†’ worker-decode sets is_draining=True, stops SQS long-poll
  β†’ in-flight llama.cpp request completes
  β†’ worker-decode reports is_processing=False
  β†’ drain.py polls /slots β†’ sees 0 in-flight
  β†’ drain.py exits cleanly (sys.exit 0)
  β†’ ECS terminates task (no dropped KV cache)
  β†’ controller.drain_succeeded metric β†’ 1 on Datadog dashboard

About

A fault-tolerant LLM routing system that decouples inference from AWS Bedrock by routing prefill and decode tasks through SQS and ensuring zero-downtime scaling with a graceful drain sidecar.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors