Skip to content

Conversation

@comaniac
Copy link
Collaborator

@comaniac comaniac commented Aug 16, 2024

This PR adds prefix cache hit rate to log metrics. The metrics will be logged only when the prefix cache is enabled. Here is an example:

[INFO 08-16 11:53:40 metrics.py:418] Avg prompt throughput: 2876.7 tokens/s, Avg generation throughput: 384.8 tokens/s, Running: 91 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 95.2%, CPU KV cache usage: 0.0%.
[INFO 08-16 11:53:40 metrics.py:434] Prefix cache hit rate: GPU: 22.16%, CPU: 0.00%

This PR also makes a minor improvement after #7193. Specifically in the evictor v2, we don't have to .move_to_end after updating the last access time, because the hit block will always be removed from evictor and added back when free. Since the free_table is an ordered dict, this process already guarantees the blocks are sorted by access time. The evictor v1 also leverages this characteristic.

Here are some results based on my downstream task for Llama-3-8B on L4:

Block Manager Hit Rate Throughput
v1 18.93% 3614 toks/s
v2 (main) 22.16% 3184 toks/s
v2 (this PR) 22.16% 3208 toks/s

The gap between v1 and v2 (this PR) is still under investigation and is out of scope of this PR.

cc @cadedaniel @xiaobochen123

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Collaborator

@cadedaniel cadedaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments. can we add a test for at least the block manager v2 case? should be pretty easy to add at the block allocator level

class TestPrefixCachingBlockAllocator:

@comaniac
Copy link
Collaborator Author

@cadedaniel comments addressed with test added. Please let me know if there's still anything missing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: test overflow case

Copy link
Collaborator Author

@comaniac comaniac Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I improved the way of handling overflow so there won't be overflow anymore. Specifically, we group the hit rate of n*1000 queries, where n is an integer. Additionally, we maintain hit_count and query_count for less than 1000 queries. Thus, we could combine them to get the real hit rate:

incomplete_ratio = query_count / 1000
(grouped_hit_rate * n + (hit_count / query_count) * incomplete_ratio) / (n + incomplete_ratio)

Also improved the test to cover this case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG. btw i don't think we need this since python int won't overflow

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. I'm just afraid that if we host an endpoint for months, the counter will grow to a huge number which might hurt performance

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel there will be many other performance issues in such a case in vLLM. But I don't mind this code being here, as long as it's well tested.

@comaniac comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 16, 2024
@comaniac comaniac merged commit 3ac50b4 into vllm-project:main Aug 19, 2024
@comaniac comaniac deleted the prefix-hit-rate branch August 19, 2024 18:52
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

def update(self, block_id: int, last_accessed: float):
self.free_table[block_id].last_accessed = last_accessed
self.free_table.move_to_end(block_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this line?
the free_table will be unordered if update op happens.

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants