-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Refactor Prometheus and Add Request Level Metrics #2316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Prometheus and Add Request Level Metrics #2316
Conversation
…asing values. computing averages on the client is an anti-pattern for prometheus metrics and should be computed on the prom server
… grafana dashboard
…rything a lot simplier so squished everything back to a single file.
|
Hi @rib-2, thank you for your contribution. This PR is definitely in the right direction, few things to start:
|
|
@simon-mo Sounds good. Thanks for the feedback.
For the existing logging message --> are you referring to the |
Yes! |
I like the idea of doing it in another PR |
…s to be compatible with prior verions (and adds back the gaugues that compute avg tput for backwards compatibility
|
Only outstanding item I think is the
@simon-mo requesting re-review |
NikolaBorisov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon-mo I think this is good, and should be merged
simon-mo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the great work here. And thanks @NikolaBorisov for the review.
* Add vllm-online-serving * Add prom metrics * Update monitoring * remove logging * Add labels * Use vllm directly from upstream latest to pick up vllm-project/vllm#2316 * Roll back vllm to 0.3.0 * Get patch files for metrics in vllm-project/vllm#2316 * Update llm_engine.py * Write documents * Add vllm-online-serving/README-ko.md * write README.md
Summary
This PR does three things:
A) Addresses open feature request (#1870) by refactoring and extending initial implementation of metrics (#1890) to:
B) Creates an end-to-end example for how to monitoring vLLM with Prometheus and Grafana
C) Updates the existing metric implementations to follow Prometheus best practices, namely:
vllm:num_requests_runningshould bevllm_num_requests_running_totalGaugesratherCounters+ PromQLrate(Prom Docs) ->vllm:avg_generation_throughput_toks_per_secshould be aCountercalledvllm_generation_tokens_totaland use PromQLrate(vllm_generation_tokens_total[5s])to calc tokens / second during dashboarding.A) Implementation
Created / updated the following classes:
SequenceGroup: addedlast_token_timevariable andget_last_latency/get_e2e_latency) methods, which enables us to capture the request-level latencies if logging is enabled.LLMEngine: added aPrometheusLoggerand logic to createStats, making a cleaner interface between theLLMEngineand logging-related functionality._process_model_outputs, we callLLMEngine._get_statsto generateStatsthat are passed to thePrometheusLogger.log.PrometheusLogger: holds a list ofPrometheusMetricsand passes theStatgenerated by theLLMEngineto each.PrometheusMetric: holds a metric (aioprometheuscollectorCounter,Gauge,Histogram) and a function to extract the appropriate data fromStatsWithin this framework, created a registry of
PrometheusMetrics:Currently Supported Include:
counter_prompt_tokens--> used with rate() to calculate prompt token throughputcounter_generation_tokens--> used with rate() to calculate generation token throughputgauge_scheduler_runninggauge_scheduler_swappedgauge_scheduler_waitinggauge_gpu_cache_usagegauge_cpu_cache_usagehistogram_time_to_first_token--> exposes counters needed to calculate avg ttft, P50, P90, P95, P99histogram_inter_token_latency--> exposes counters needed to calculate avg itl, P50, P90, P95, P99histogram_e2e_request_latency--> exposes counters needed to calculate e2e request latency, P50, P90, P95, P99See the Example for a dashboard that shows how these exposed metrics should be monitored
B) Example
See examples/production_monitoring for an end-to-end example. I included a Grafana dashboard configuration which shows how these metrics should be monitored.
C) Best Practices
I recognize these changes have breaking impacts on the metrics exposed to users.
Key changes include:
vllm:num_requests_swapped-->vllm_requests_stopped_total)vllm:avg_prompt_throughput_toks_per_s/vllm:avg_generation_throughput_toks_per_s) to be total tokens processed counters (vllm_prompt_tokens_total/vllm_generation_tokens_total)rate(vllm_prompt_tokens_total[30s])My sense is that this is a very new feature, so Im not sure how much user impact there is. However, I think the changes I am suggesting are justified. I am happy to revert these if requested.
Overhead
I used the benchmarking scripts to test performance with and without the logger on an L4 GPU. There is very minor latency.
benchmark_serving.pyClient:python3 benchmark_serving.py --backend vllm --tokenizer mistralai/Mistral-7B-v0.1 --dataset /home/robertgshaw/vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 1.0 --num-prompts 200Launch with System Logging:
python3 -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-model-len 4096 --swap-space 16 --disable-log-requestspython3 -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-model-len 4096 --swap-space 16 --disable-log-stats --disable-log-requestsNext Steps
Next steps to finalize the PR are:
Questions
Are there any other things I need to do?