A script to replay Mooncake traces (https://github.com/kvcache-ai/Mooncake/blob/main/mooncake_trace.jsonl) against vLLM servers for performance testing and benchmarking.
- vLLM (installed via pip)
- Required Python packages:
pip install vllm transformers aiohttp
# Run from outside the vLLM source directory to avoid import conflicts
cd /home/ie-user
source kobe/vllm/.venv/bin/activate
vllm serve NousResearch/Llama-3.2-1Bcd /path/to/vllm/source
bash run_mooncake_replay.shModify these environment variables in run_mooncake_replay.sh or set them before running:
MODEL="NousResearch/Llama-3.2-1B" # Model to test
HOST="localhost" # Server host
PORT="8000" # Server port
BACKEND="vllm" # Backend type
DURATION="60" # Test duration (seconds)
TIME_SCALE="1.0" # Speed up/slow down replay
PRESERVE_TIMING="true" # Keep original request timingSuccessful requests: 313
Failed requests: 0
Total duration: 60.48s
Mean TTFT: 2187.98ms
Mean TPOT: 26.59ms
Results are saved to mooncake_replay_results.json with detailed metrics.
- The script preserves original request timing for realistic load testing
- Multiple requests run concurrently to simulate real traffic patterns
- Ensure vLLM server is running and accessible before starting replay