-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Add Multimodal Processor Benchmark #29105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Documentation preview: https://vllm--29105.org.readthedocs.build/en/29105/ |
14fbc67 to
ff81b6c
Compare
change change change change change change change change change Signed-off-by: Reagan Lee <[email protected]>
Signed-off-by: Reagan Lee <[email protected]>
81b2c51 to
c4bab7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
https://github.com/vllm-project/vllm/blob/c4bab7adcbcd8c179e064ae9d8620426523c55d2/benchmarks/multimodal_processor.py#L335-L337
Clearing stats before collection wipes benchmark results
The multimodal benchmark clears the server’s MM processor stats registry right after the run and before fetching stats (should_clear_registry defaults to True). When running with VLLM_ENABLE_MM_PROCESSOR_STATS=1 and VLLM_SERVER_DEV_MODE=1 (required for the /clear_mm_processor_stats endpoint the client calls), this POST removes the just-recorded measurements so collect_mm_processor_stats immediately reads an empty registry and the benchmark prints zeros/warnings instead of the run’s timings. The registry should be cleared before starting a run to drop stale data, not immediately before retrieval.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Isotr0py
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this feature! I leave some initial comments. PTAL!
| @dataclass | ||
| class MultiModalProcessorTimingStats: | ||
| """Per-request timing statistics for multimodal processor stages.""" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move the stats objects to a new file multimodal/stats.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that making the stats container a global is a bit hacky. Perhaps it's better to have it as a member of InputProcessingContext.
| "VLLM_ENABLE_MM_PROCESSOR_STATS": lambda: bool( | ||
| int(os.getenv("VLLM_ENABLE_MM_PROCESSOR_STATS", "0")) | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, our environment variables are quite bloated. Perhaps we can move this to MultimodalConfig?
| def _get_instance_urls( | ||
| args: argparse.Namespace | None, | ||
| base_url: str | None, | ||
| ) -> list[str]: | ||
| """ | ||
| Get the instance URL for stats collection. | ||
| Args: | ||
| args: Command-line arguments (may be None) | ||
| base_url: Base URL for the server (may be None) | ||
| Returns: | ||
| List containing a single instance URL | ||
| """ | ||
| if base_url: | ||
| return [base_url] | ||
|
|
||
| host = getattr(args, "host", "127.0.0.1") if args else "127.0.0.1" | ||
| port = getattr(args, "port", 8000) if args else 8000 | ||
| return [f"http://{host}:{port}"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, all bench commands except vllm bench serve are using offline inference?
I think we can extend mm processor benchmarks functionality for vllm bench serve with online serving, and leaving vllm bench multimodal-processor to use offline inference. 🤔
| if "mm_processor_stats" in result: | ||
| print("\nMM Processor Timing (ms):") | ||
| selected_percentiles = [ | ||
| float(p) for p in getattr(args, "metric_percentiles", "99").split(",") | ||
| ] | ||
| for stage, metrics in result["mm_processor_stats"].items(): | ||
| print(f" {stage}:") | ||
| print(f" Mean: {metrics['mean']:.2f}") | ||
| print(f" Median: {metrics['median']:.2f}") | ||
| print(f" Std: {metrics['std']:.2f}") | ||
| for p in selected_percentiles: | ||
| print(f" P{p}: {metrics.get(f'p{p}', 0.0):.2f}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can format the output's display as a table using pandas's dataframe. Current output is a bit long with duplicate fields:
hf_processor_time:
Mean: 204.14
Median: 201.04
Std: 48.78
P99.0: 319.08
hashing_time:
Mean: 1.58
Median: 1.94
Std: 1.25
P99.0: 5.57
cache_lookup_time:
Mean: 0.16
Median: 0.14
Std: 0.10
P99.0: 0.44
prompt_update_time:
Mean: 7.30
Median: 7.02
Std: 0.96
P99.0: 8.80
total_time:
Mean: 232.79
Median: 211.02
Std: 208.11
P99.0: 349.57
Resolves #24171.
Purpose
This PR adds in a benchmark for the multimodal processor. This tests for the MM processor on a single vllm instance on a single GPU.
Originally, I planned to additionally test for 1. DP on multiple GPUs and 2. multiple instances on a single GPU. On second thought, testing 1. DP seems equivalent to just running the benchmark multiple times. For 2. multiple instances on single GPU, I would like direction on which type of load balancing should be done on the #2 point. I could use a nginx round robin, or is the request router from the production_stack repo recommended? This also seems to be a good stopping place since the PR is just around 800 lines.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.