-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
add num_concurrency_requests metric to track concurrent requests running/waiting #13799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
vllm/v1/metrics/loggers.py
Outdated
| iteration_stats: IterationStats): | ||
| """Log to prometheus.""" | ||
| self.gauge_scheduler_running.set(scheduler_stats.num_running_reqs) | ||
| self.gauge_scheduler_total.set(scheduler_stats.num_running_reqs + \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems scheduler_stats are asyncly collected, especially when multip processing is involved. For concurrent request counter to perform effectively as load balancing counter, the recency matters a lot. Would it be better to directly track the concurrency count (how many active http requests) at http service level?
|
This pull request has merge conflicts that must be resolved before it can be |
…ng/waiting Signed-off-by: Daniel Salib <[email protected]>
|
This approach uses AsyncLLM._log_stats to collect trigger metrics logging, which has delays in reporting accurate concurrent request count. In order to have this load balancing counter work well, we need to make sure it is reflected in real-time. |
|
makes sense! I adopted the approach in a new PR |
Add "vllm:num_requests_total" Metric for Scheduler State, which combines the total from "vllm:num_requests_running" and "vllm:num_requests_waiting" into a single metric.
This PR introduces a new metric, vllm:num_requests_total, which tracks the total number of requests running or waiting in the scheduler. This metric provides a more comprehensive view of the scheduler's state and helps with monitoring and debugging.