[MISC] Add prefix cache hit rate to metrics #7606

comaniac · 2024-08-16T19:03:46Z

This PR adds prefix cache hit rate to log metrics. The metrics will be logged only when the prefix cache is enabled. Here is an example:

[INFO 08-16 11:53:40 metrics.py:418] Avg prompt throughput: 2876.7 tokens/s, Avg generation throughput: 384.8 tokens/s, Running: 91 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 95.2%, CPU KV cache usage: 0.0%.
[INFO 08-16 11:53:40 metrics.py:434] Prefix cache hit rate: GPU: 22.16%, CPU: 0.00%

This PR also makes a minor improvement after #7193. Specifically in the evictor v2, we don't have to .move_to_end after updating the last access time, because the hit block will always be removed from evictor and added back when free. Since the free_table is an ordered dict, this process already guarantees the blocks are sorted by access time. The evictor v1 also leverages this characteristic.

Here are some results based on my downstream task for Llama-3-8B on L4:

Block Manager	Hit Rate	Throughput
v1	18.93%	3614 toks/s
v2 (main)	22.16%	3184 toks/s
v2 (this PR)	22.16%	3208 toks/s

The gap between v1 and v2 (this PR) is still under investigation and is out of scope of this PR.

cc @cadedaniel @xiaobochen123

github-actions · 2024-08-16T19:03:59Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

cadedaniel

small comments. can we add a test for at least the block manager v2 case? should be pretty easy to add at the block allocator level

vllm/tests/core/block/test_prefix_caching_block.py

Line 154 in f366f63

class TestPrefixCachingBlockAllocator:

vllm/core/block/interfaces.py

vllm/core/block/prefix_caching_block.py

vllm/engine/llm_engine.py

comaniac · 2024-08-16T19:56:41Z

@cadedaniel comments addressed with test added. Please let me know if there's still anything missing.

cadedaniel · 2024-08-16T19:59:26Z

tests/core/block/test_prefix_caching_block.py

nit: test overflow case

I improved the way of handling overflow so there won't be overflow anymore. Specifically, we group the hit rate of n*1000 queries, where n is an integer. Additionally, we maintain hit_count and query_count for less than 1000 queries. Thus, we could combine them to get the real hit rate:

incomplete_ratio = query_count / 1000 (grouped_hit_rate * n + (hit_count / query_count) * incomplete_ratio) / (n + incomplete_ratio)

Also improved the test to cover this case.

SG. btw i don't think we need this since python int won't overflow

That's true. I'm just afraid that if we host an endpoint for months, the counter will grow to a huge number which might hurt performance

I feel there will be many other performance issues in such a case in vLLM. But I don't mind this code being here, as long as it's well tested.

Signed-off-by: Alvant <[email protected]>

yudian0504 · 2024-11-27T08:17:05Z

vllm/core/evictor_v2.py


    def update(self, block_id: int, last_accessed: float):
        self.free_table[block_id].last_accessed = last_accessed
-        self.free_table.move_to_end(block_id)


why remove this line?
the free_table will be unordered if update op happens.

Signed-off-by: LeiWang1999 <[email protected]>

cadedaniel reviewed Aug 16, 2024

View reviewed changes

vllm/core/block/interfaces.py Outdated Show resolved Hide resolved

vllm/core/block/prefix_caching_block.py Outdated Show resolved Hide resolved

vllm/engine/llm_engine.py Outdated Show resolved Hide resolved

cadedaniel approved these changes Aug 16, 2024

View reviewed changes

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 16, 2024

update

d674e59

comaniac force-pushed the prefix-hit-rate branch from 8f75922 to d674e59 Compare August 19, 2024 16:40

comaniac merged commit 3ac50b4 into vllm-project:main Aug 19, 2024

comaniac deleted the prefix-hit-rate branch August 19, 2024 18:52

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[MISC] Add prefix cache hit rate to metrics (vllm-project#7606)

e11a5c9

Signed-off-by: Alvant <[email protected]>

AlpinDale mentioned this pull request Nov 20, 2024

feat: add metrics for prefix cache hit rate aphrodite-engine/aphrodite-engine#829

Merged

yudian0504 reviewed Nov 27, 2024

View reviewed changes

markmc mentioned this pull request Jan 31, 2025

[V1][Metrics] Add GPU prefix cache hit rate % gauge #12592

Merged

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[MISC] Add prefix cache hit rate to metrics (vllm-project#7606)

48d5916

Signed-off-by: LeiWang1999 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MISC] Add prefix cache hit rate to metrics #7606

[MISC] Add prefix cache hit rate to metrics #7606

Uh oh!

comaniac commented Aug 16, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Aug 16, 2024

Uh oh!

cadedaniel left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

comaniac commented Aug 16, 2024

Uh oh!

cadedaniel Aug 16, 2024

Uh oh!

comaniac Aug 16, 2024 •

edited

Loading

Uh oh!

cadedaniel Aug 16, 2024

Uh oh!

comaniac Aug 16, 2024

Uh oh!

cadedaniel Aug 16, 2024

Uh oh!

yudian0504 Nov 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[MISC] Add prefix cache hit rate to metrics #7606

[MISC] Add prefix cache hit rate to metrics #7606

Uh oh!

Conversation

comaniac commented Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 16, 2024

Uh oh!

cadedaniel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

comaniac commented Aug 16, 2024

Uh oh!

cadedaniel Aug 16, 2024

Choose a reason for hiding this comment

Uh oh!

comaniac Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadedaniel Aug 16, 2024

Choose a reason for hiding this comment

Uh oh!

comaniac Aug 16, 2024

Choose a reason for hiding this comment

Uh oh!

cadedaniel Aug 16, 2024

Choose a reason for hiding this comment

Uh oh!

yudian0504 Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

comaniac commented Aug 16, 2024 •

edited

Loading

comaniac Aug 16, 2024 •

edited

Loading