Skip to content

Conversation

@reaganjlee
Copy link

@reaganjlee reaganjlee commented Nov 20, 2025

Resolves #24171.

Purpose

This PR adds in a benchmark for the multimodal processor. This tests for the MM processor on a single vllm instance on a single GPU.

Originally, I planned to additionally test for 1. DP on multiple GPUs and 2. multiple instances on a single GPU. On second thought, testing 1. DP seems equivalent to just running the benchmark multiple times. For 2. multiple instances on single GPU, I would like direction on which type of load balancing should be done on the #2 point. I could use a nginx round robin, or is the request router from the production_stack repo recommended? This also seems to be a good stopping place since the PR is just around 800 lines.

Test Plan

# Using just HuggingFaceTB/SmolVLM-Instruct to save on compute during development, and random-mm dataset for simplicity
vllm bench multimodal-processor     --model HuggingFaceTB/SmolVLM-Instruct     --dataset-name random-mm     --num-prompts 100     --max-concurrency 10

# concurrency
vllm bench multimodal-processor     --model HuggingFaceTB/SmolVLM-Instruct     --dataset-name random-mm     --num-prompts 100     --max-concurrency 50

Test Result

# Example result
# vllm bench multimodal-processor     --model HuggingFaceTB/SmolVLM-Instruct     --dataset-name random-mm     --num-prompts 100     --max-concurrency 10
================================================================================
Multimodal Processor Benchmark Results
================================================================================

MM Processor Timing (ms):
  hf_processor_time:
    Mean: 204.14
    Median: 201.04
    Std: 48.78
    P99.0: 319.08
  hashing_time:
    Mean: 1.58
    Median: 1.94
    Std: 1.25
    P99.0: 5.57
  cache_lookup_time:
    Mean: 0.16
    Median: 0.14
    Std: 0.10
    P99.0: 0.44
  prompt_update_time:
    Mean: 7.30
    Median: 7.02
    Std: 0.96
    P99.0: 8.80
  total_time:
    Mean: 232.79
    Median: 211.02
    Std: 208.11
    P99.0: 349.57

End-to-End Latency (ms):
  Mean: 2271.75
  Median: 2274.77
  Std: 656.52

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify
Copy link

mergify bot commented Nov 20, 2025

Documentation preview: https://vllm--29105.org.readthedocs.build/en/29105/

@mergify mergify bot added documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) performance Performance-related issues v1 labels Nov 20, 2025
@reaganjlee reaganjlee changed the title Add Multimodal Processor Benchmark [MM processor]: Add Multimodal Processor Benchmark Nov 21, 2025
@reaganjlee reaganjlee changed the title [MM processor]: Add Multimodal Processor Benchmark Add Multimodal Processor Benchmark Nov 26, 2025
@reaganjlee reaganjlee force-pushed the mm-processor branch 3 times, most recently from 14fbc67 to ff81b6c Compare November 26, 2025 23:22
change

change

change

change

change

change

change

change

change

Signed-off-by: Reagan Lee <[email protected]>
Signed-off-by: Reagan Lee <[email protected]>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

https://github.com/vllm-project/vllm/blob/c4bab7adcbcd8c179e064ae9d8620426523c55d2/benchmarks/multimodal_processor.py#L335-L337
P1 Badge Clearing stats before collection wipes benchmark results

The multimodal benchmark clears the server’s MM processor stats registry right after the run and before fetching stats (should_clear_registry defaults to True). When running with VLLM_ENABLE_MM_PROCESSOR_STATS=1 and VLLM_SERVER_DEV_MODE=1 (required for the /clear_mm_processor_stats endpoint the client calls), this POST removes the just-recorded measurements so collect_mm_processor_stats immediately reads an empty registry and the benchmark prints zeros/warnings instead of the run’s timings. The registry should be cleared before starting a run to drop stale data, not immediately before retrieval.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this feature! I leave some initial comments. PTAL!

Comment on lines +77 to +80
@dataclass
class MultiModalProcessorTimingStats:
"""Per-request timing statistics for multimodal processor stages."""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the stats objects to a new file multimodal/stats.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that making the stats container a global is a bit hacky. Perhaps it's better to have it as a member of InputProcessingContext.

Comment on lines +772 to +774
"VLLM_ENABLE_MM_PROCESSOR_STATS": lambda: bool(
int(os.getenv("VLLM_ENABLE_MM_PROCESSOR_STATS", "0"))
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, our environment variables are quite bloated. Perhaps we can move this to MultimodalConfig?

Comment on lines +53 to +72
def _get_instance_urls(
args: argparse.Namespace | None,
base_url: str | None,
) -> list[str]:
"""
Get the instance URL for stats collection.
Args:
args: Command-line arguments (may be None)
base_url: Base URL for the server (may be None)
Returns:
List containing a single instance URL
"""
if base_url:
return [base_url]

host = getattr(args, "host", "127.0.0.1") if args else "127.0.0.1"
port = getattr(args, "port", 8000) if args else 8000
return [f"http://{host}:{port}"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, all bench commands except vllm bench serve are using offline inference?

I think we can extend mm processor benchmarks functionality for vllm bench serve with online serving, and leaving vllm bench multimodal-processor to use offline inference. 🤔

Comment on lines +420 to +431
if "mm_processor_stats" in result:
print("\nMM Processor Timing (ms):")
selected_percentiles = [
float(p) for p in getattr(args, "metric_percentiles", "99").split(",")
]
for stage, metrics in result["mm_processor_stats"].items():
print(f" {stage}:")
print(f" Mean: {metrics['mean']:.2f}")
print(f" Median: {metrics['median']:.2f}")
print(f" Std: {metrics['std']:.2f}")
for p in selected_percentiles:
print(f" P{p}: {metrics.get(f'p{p}', 0.0):.2f}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can format the output's display as a table using pandas's dataframe. Current output is a bit long with duplicate fields:

  hf_processor_time:
    Mean: 204.14
    Median: 201.04
    Std: 48.78
    P99.0: 319.08
  hashing_time:
    Mean: 1.58
    Median: 1.94
    Std: 1.25
    P99.0: 5.57
  cache_lookup_time:
    Mean: 0.16
    Median: 0.14
    Std: 0.10
    P99.0: 0.44
  prompt_update_time:
    Mean: 7.30
    Median: 7.02
    Std: 0.96
    P99.0: 8.80
  total_time:
    Mean: 232.79
    Median: 211.02
    Std: 208.11
    P99.0: 349.57

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) performance Performance-related issues v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MM processor]: Benchmark mm processor's performance

3 participants