Skip to content

[Feature] support multimodal inputs with multiple requests#76

Merged
tzhouam merged 6 commits intovllm-project:mainfrom
Gaohan123:multimodal_input
Nov 24, 2025
Merged

[Feature] support multimodal inputs with multiple requests#76
tzhouam merged 6 commits intovllm-project:mainfrom
Gaohan123:multimodal_input

Conversation

@Gaohan123
Copy link
Collaborator

Purpose

This PR supports multimodal inputs for qwen2.5-omni. The modalities include text, image, audio, video and their combinations. The implementation partly refers to PR #57 in vllm-omni and PR#26634 in vllm main branch.

Test Plan

Here I both tests offline and online inference. First please follow README.md in root repo to finish installation.

Online inference

vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091

Follow README.md in examples/online_serving

Offline inference

Follow README.md in examples/offline_inference/qwen2_5_omni

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@Gaohan123 Gaohan123 requested review from ZJY0516 and tzhouam and removed request for tzhouam November 21, 2025 03:18
@@ -7,20 +7,15 @@
import numpy as np
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please refactor the models folder for each model, all qwen2_5_omni related files should be placed in an independent folder

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -0,0 +1,23 @@
import torch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this vision.py be merged into utils.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file aligns with the path in vllm main

return {maybe_prefix(prefix, name) for name in weights}


def split_list_into_ranges(lst: torch.Tensor, interval: int) -> list[list[int]]:
Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def split_list_into_ranges_fast(lst: torch.Tensor, interval: int) -> list[list[int]]:
    if lst.numel() == 0:
        return []
    
    # Move to CPU and convert to list once (High Speedup)
    # using .item() inside a loop is very slow.
    data_list = lst.detach().cpu().tolist()
    
    # Calculate max on the list or tensor (Tensor max is fast enough)
    max_val = int(torch.max(lst).item())
    
    # Pre-allocate buckets
    ranges: list[list[int]] = [[] for _ in range((max_val // interval) + 1)]
    
    for num in data_list:
        index = int(num // interval)
        ranges[index].append(num)
        
    return ranges

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. Thanks for advice.

return OmniAsyncMultiModalContentParser(self)


class OmniAsyncMultiModalContentParser(AsyncMultiModalContentParser):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments from gemini:

here is a critical performance issue: your _extract_audio_from_video_async method is defined as async, but it performs blocking synchronous I/O (file downloads, librosa.load, file writes).

In an async framework like vLLM, this will freeze the entire inference engine (heartbeats, other requests, token generation) while librosa processes the video.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Signed-off-by: Gaohan123 <[email protected]>
Signed-off-by: Gaohan123 <[email protected]>
Signed-off-by: Gaohan123 <[email protected]>
Signed-off-by: Gaohan123 <[email protected]>
Signed-off-by: Gaohan123 <[email protected]>
@tzhouam
Copy link
Collaborator

tzhouam commented Nov 24, 2025

lgtm, merging

@tzhouam tzhouam merged commit 598b4dd into vllm-project:main Nov 24, 2025
2 checks passed
ZJY0516 pushed a commit that referenced this pull request Nov 25, 2025
princepride pushed a commit to princepride/vllm-omni that referenced this pull request Jan 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants