-
-
Notifications
You must be signed in to change notification settings - Fork 12k
[Core] Don't count preempted tokens in prefix cache hit rate #25787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Zhuohan Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly modifies the prefix cache hit rate calculation to exclude tokens from preempted requests. The changes are implemented by adding a preemption counter to requests, which is then used to separate statistics for new and preempted requests. The implementation is clean and directly addresses the issue. The associated refactoring in the scheduler improves code readability. I have reviewed the changes and found no issues.
| self.encoder_cache_manager.free(preempted_req) | ||
| preempted_req.status = RequestStatus.PREEMPTED | ||
| preempted_req.num_computed_tokens = 0 | ||
| preempted_req.num_preemptions += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this line is the only actual change in this file, all other changes are code style change.
WoosukKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for doing this.
…oject#25787) Signed-off-by: Zhuohan Li <[email protected]>
Signed-off-by: Zhuohan Li <[email protected]> Signed-off-by: yewentao256 <[email protected]>
…oject#25787) Signed-off-by: Zhuohan Li <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…oject#25787) Signed-off-by: Zhuohan Li <[email protected]>
…oject#25787) Signed-off-by: Zhuohan Li <[email protected]>
…oject#25787) Signed-off-by: Zhuohan Li <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…oject#25787) Signed-off-by: Zhuohan Li <[email protected]>
…oject#25787) Signed-off-by: Zhuohan Li <[email protected]>
Purpose
Do not count preempted tokens in prefix cache hit rate. See #25780 for details.
Test
Existing test should pass, and now when we run:
We see
Prefix cache hit ratedroped from 30% to 0.2%.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.