[Feature] support multimodal inputs with multiple requests by Gaohan123 · Pull Request #76 · vllm-project/vllm-omni

Gaohan123 · 2025-11-20T18:55:00Z

Purpose

This PR supports multimodal inputs for qwen2.5-omni. The modalities include text, image, audio, video and their combinations. The implementation partly refers to PR #57 in vllm-omni and PR#26634 in vllm main branch.

Test Plan

Here I both tests offline and online inference. First please follow README.md in root repo to finish installation.

Online inference

vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091

Follow README.md in examples/online_serving

Offline inference

Follow README.md in examples/offline_inference/qwen2_5_omni

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

hsliuustc0106 · 2025-11-23T13:57:14Z

vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni.py

@@ -7,20 +7,15 @@
 import numpy as np


please refactor the models folder for each model, all qwen2_5_omni related files should be placed in an independent folder

hsliuustc0106 · 2025-11-23T13:57:44Z

vllm_omni/model_executor/models/vision.py

@@ -0,0 +1,23 @@
+import torch


could this vision.py be merged into utils.py?

The file aligns with the path in vllm main

hsliuustc0106 · 2025-11-23T14:03:15Z

vllm_omni/model_executor/models/utils.py

    return {maybe_prefix(prefix, name) for name in weights}
+
+
+def split_list_into_ranges(lst: torch.Tensor, interval: int) -> list[list[int]]:


def split_list_into_ranges_fast(lst: torch.Tensor, interval: int) -> list[list[int]]: if lst.numel() == 0: return [] # Move to CPU and convert to list once (High Speedup) # using .item() inside a loop is very slow. data_list = lst.detach().cpu().tolist() # Calculate max on the list or tensor (Tensor max is fast enough) max_val = int(torch.max(lst).item()) # Pre-allocate buckets ranges: list[list[int]] = [[] for _ in range((max_val // interval) + 1)] for num in data_list: index = int(num // interval) ranges[index].append(num) return ranges

fixed. Thanks for advice.

hsliuustc0106 · 2025-11-23T14:07:29Z

vllm_omni/entrypoints/chat_utils.py

+        return OmniAsyncMultiModalContentParser(self)
+
+
+class OmniAsyncMultiModalContentParser(AsyncMultiModalContentParser):


comments from gemini:

here is a critical performance issue: your _extract_audio_from_video_async method is defined as async, but it performs blocking synchronous I/O (file downloads, librosa.load, file writes).

In an async framework like vLLM, this will freeze the entire inference engine (heartbeats, other requests, token generation) while librosa processes the video.

Signed-off-by: Gaohan123 <[email protected]>

tzhouam · 2025-11-24T08:02:37Z

lgtm, merging

Signed-off-by: Gaohan123 <[email protected]> Signed-off-by: zjy0516 <[email protected]>

…ect#76) Signed-off-by: Gaohan123 <[email protected]>

Gaohan123 requested review from ZJY0516 and tzhouam and removed request for tzhouam November 21, 2025 03:18

hsliuustc0106 mentioned this pull request Nov 21, 2025

[Feature] Add Gradio Demo for Qwen2.5Omni #60

Merged

8 tasks

hsliuustc0106 reviewed Nov 23, 2025

View reviewed changes

Gaohan123 added 6 commits November 24, 2025 13:45

support multimodal inputs for qwen2.5-omni

1cf983e

Signed-off-by: Gaohan123 <[email protected]>

fix precommit

1fb21ce

Signed-off-by: Gaohan123 <[email protected]>

fix comments

7528328

Signed-off-by: Gaohan123 <[email protected]>

fix precommit all files

027966a

Signed-off-by: Gaohan123 <[email protected]>

fix precommit of linux

b6ee4cc

Signed-off-by: Gaohan123 <[email protected]>

fix reviews

fb992d3

Signed-off-by: Gaohan123 <[email protected]>

Gaohan123 force-pushed the multimodal_input branch from 04898c5 to fb992d3 Compare November 24, 2025 07:53

tzhouam merged commit 598b4dd into vllm-project:main Nov 24, 2025
2 checks passed

Gaohan123 mentioned this pull request Nov 24, 2025

[WIP] Support Multimodal Input for Qwen2.5 Omni #57

Closed

5 tasks

ZJY0516 pushed a commit that referenced this pull request Nov 25, 2025

[Feature] support multimodal inputs with multiple requests (#76)

f51c878

Signed-off-by: Gaohan123 <[email protected]> Signed-off-by: zjy0516 <[email protected]>

AndyZhou952 mentioned this pull request Nov 28, 2025

[WIP] Add Ascend NPU Backend Support for Qwen2.5-Omni #77

Closed

5 tasks

princepride pushed a commit to princepride/vllm-omni that referenced this pull request Jan 10, 2026

[Feature] support multimodal inputs with multiple requests (vllm-proj…

23bb34f

…ect#76) Signed-off-by: Gaohan123 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] support multimodal inputs with multiple requests#76

[Feature] support multimodal inputs with multiple requests#76
tzhouam merged 6 commits intovllm-project:mainfrom
Gaohan123:multimodal_input

Gaohan123 commented Nov 20, 2025

Uh oh!

hsliuustc0106 Nov 23, 2025

Uh oh!

Gaohan123 Nov 24, 2025

Uh oh!

hsliuustc0106 Nov 23, 2025

Uh oh!

Gaohan123 Nov 24, 2025

Uh oh!

hsliuustc0106 Nov 23, 2025 •

edited

Loading

Uh oh!

Gaohan123 Nov 24, 2025

Uh oh!

hsliuustc0106 Nov 23, 2025

Uh oh!

Gaohan123 Nov 24, 2025

Uh oh!

tzhouam commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return {maybe_prefix(prefix, name) for name in weights}


		def split_list_into_ranges(lst: torch.Tensor, interval: int) -> list[list[int]]:

		return OmniAsyncMultiModalContentParser(self)


		class OmniAsyncMultiModalContentParser(AsyncMultiModalContentParser):

Conversation

Gaohan123 commented Nov 20, 2025

Purpose

Test Plan

Online inference

Offline inference

Test Result

Uh oh!

hsliuustc0106 Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Gaohan123 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Gaohan123 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Gaohan123 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Gaohan123 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

tzhouam commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hsliuustc0106 Nov 23, 2025 •

edited

Loading