-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[V1][Minor] Print KV cache size in token counts #13596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
| logger.info("GPU KV cache size: %s tokens", num_tokens_str) | ||
| max_model_len_str = f"{vllm_config.model_config.max_model_len:,}" | ||
| max_concurrency = num_tokens / vllm_config.model_config.max_model_len | ||
| logger.info("Maximum concurrency for %s tokens per request: %.2fx", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider putting these lines together in one log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd personally prefer them on separate lines because the first line is more important and needs to be checked every time, while the second line is less critical.
| logger.info("# GPU blocks: %d", num_blocks) | ||
| max_concurrency = (num_blocks * vllm_config.cache_config.block_size / | ||
| vllm_config.model_config.max_model_len) | ||
| num_tokens = num_blocks * vllm_config.cache_config.block_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be accurate with hybrid of cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, but I think num_blocks is already inaccurate? We should reconsider this for hybrid memory models.
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Change