-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[MISC] Add prefix cache hit rate to metrics #7606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
cadedaniel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments. can we add a test for at least the block manager v2 case? should be pretty easy to add at the block allocator level
| class TestPrefixCachingBlockAllocator: |
|
@cadedaniel comments addressed with test added. Please let me know if there's still anything missing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: test overflow case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I improved the way of handling overflow so there won't be overflow anymore. Specifically, we group the hit rate of n*1000 queries, where n is an integer. Additionally, we maintain hit_count and query_count for less than 1000 queries. Thus, we could combine them to get the real hit rate:
incomplete_ratio = query_count / 1000
(grouped_hit_rate * n + (hit_count / query_count) * incomplete_ratio) / (n + incomplete_ratio)
Also improved the test to cover this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG. btw i don't think we need this since python int won't overflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. I'm just afraid that if we host an endpoint for months, the counter will grow to a huge number which might hurt performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel there will be many other performance issues in such a case in vLLM. But I don't mind this code being here, as long as it's well tested.
8f75922 to
d674e59
Compare
Signed-off-by: Alvant <[email protected]>
|
|
||
| def update(self, block_id: int, last_accessed: float): | ||
| self.free_table[block_id].last_accessed = last_accessed | ||
| self.free_table.move_to_end(block_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this line?
the free_table will be unordered if update op happens.
Signed-off-by: LeiWang1999 <[email protected]>
This PR adds prefix cache hit rate to log metrics. The metrics will be logged only when the prefix cache is enabled. Here is an example:
This PR also makes a minor improvement after #7193. Specifically in the evictor v2, we don't have to
.move_to_endafter updating the last access time, because the hit block will always be removed from evictor and added back when free. Since thefree_tableis an ordered dict, this process already guarantees the blocks are sorted by access time. The evictor v1 also leverages this characteristic.Here are some results based on my downstream task for Llama-3-8B on L4:
The gap between v1 and v2 (this PR) is still under investigation and is out of scope of this PR.
cc @cadedaniel @xiaobochen123