-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Performance] Cache loaded custom logitsprocs to avoid overheads #28462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Isotr0py <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization by caching the results of loading custom logits processors using lru_cache. This is a valuable improvement to reduce overhead. However, I've identified a critical issue where using lru_cache with a potentially unhashable list argument will lead to a TypeError at runtime. I have provided a review comment with a suggested fix to ensure the argument is always hashable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| cached_load_custom_logitsprocs = lru_cache(_load_custom_logitsprocs) | ||
|
|
||
|
|
||
| def validate_logits_processors_parameters( | ||
| logits_processors: Sequence[str | type[LogitsProcessor]] | None, | ||
| sampling_params: SamplingParams, | ||
| ): | ||
| for logits_procs in _load_custom_logitsprocs(logits_processors): | ||
| for logits_procs in cached_load_custom_logitsprocs(logits_processors): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache assumes hashable logits processor sequences
The new lru_cache wrapper is invoked with logits_processors as the cache key, but the callers in the OpenAI serving paths pass the model’s logits_processors list directly. Lists are unhashable, so cached_load_custom_logitsprocs(self.logits_processors) will now raise TypeError: unhashable type: 'list' as soon as a request is validated when custom processors are configured (e.g. via --logits-processors). This previously worked because _load_custom_logitsprocs accepted any sequence. Consider normalizing the argument to a tuple (or similar) before caching to avoid runtime failures for users supplying custom logits processors.
Useful? React with 👍 / 👎.
Signed-off-by: Isotr0py <[email protected]>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces caching for loading custom logits processors to improve performance, which is a valuable optimization. The implementation correctly uses lru_cache and handles the non-hashable Sequence type by converting it to a tuple. My review includes one high-severity suggestion to further improve the caching efficiency by applying it more granularly to avoid redundant work when loading plugins.
| ) | ||
|
|
||
|
|
||
| cached_load_custom_logitsprocs = lru_cache(_load_custom_logitsprocs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While using lru_cache on _load_custom_logitsprocs is a good performance improvement, it can be made more efficient. _load_custom_logitsprocs internally calls _load_logitsprocs_plugins(), which does not depend on any arguments. With the current implementation, _load_logitsprocs_plugins() will be re-executed for every cache miss of cached_load_custom_logitsprocs (i.e., for each new logits_processors value).
To avoid this repeated work, _load_logitsprocs_plugins() should be cached independently. The ideal solution would be to apply @lru_cache directly to _load_logitsprocs_plugins and _load_logitsprocs_by_fqcns. This would require modifying those functions, which are outside the current diff.
For example:
@lru_cache(maxsize=None)
def _load_logitsprocs_plugins() -> list[type[LogitsProcessor]]:
# ... function body
@lru_cache(maxsize=None)
def _load_logitsprocs_by_fqcns(
logits_processors: tuple[str | type[LogitsProcessor], ...] | None,
) -> list[type[LogitsProcessor]]:
# ... function bodyThen _load_custom_logitsprocs can call these cached functions, and validate_logits_processors_parameters can call _load_custom_logitsprocs directly without an extra caching layer. This would be the most efficient implementation.
…m-project#28462) Signed-off-by: Isotr0py <[email protected]>
…m-project#28462) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: George D. Torres <[email protected]>
) Signed-off-by: Isotr0py <[email protected]> (cherry picked from commit 3f770f4)
…m-project#28462) Signed-off-by: Isotr0py <[email protected]>
…m-project#28462) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Xingyu Liu <[email protected]>
Purpose
Discussion: https://vllm-dev.slack.com/archives/C08U97ZRC0J/p1762826798388249
_load_custom_logitsprocscan cause significant performance regression when validating sampling params, because it will try to import custom logitsprocs each time.lru_cacheto cache the loaded custom logits processors to avoid duplicated import.Test Plan
Test Result
Main
PR
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.