[Feature] support multimodal inputs with multiple requests#76
[Feature] support multimodal inputs with multiple requests#76tzhouam merged 6 commits intovllm-project:mainfrom
Conversation
| @@ -7,20 +7,15 @@ | |||
| import numpy as np | |||
There was a problem hiding this comment.
please refactor the models folder for each model, all qwen2_5_omni related files should be placed in an independent folder
| @@ -0,0 +1,23 @@ | |||
| import torch | |||
There was a problem hiding this comment.
could this vision.py be merged into utils.py?
There was a problem hiding this comment.
The file aligns with the path in vllm main
| return {maybe_prefix(prefix, name) for name in weights} | ||
|
|
||
|
|
||
| def split_list_into_ranges(lst: torch.Tensor, interval: int) -> list[list[int]]: |
There was a problem hiding this comment.
def split_list_into_ranges_fast(lst: torch.Tensor, interval: int) -> list[list[int]]:
if lst.numel() == 0:
return []
# Move to CPU and convert to list once (High Speedup)
# using .item() inside a loop is very slow.
data_list = lst.detach().cpu().tolist()
# Calculate max on the list or tensor (Tensor max is fast enough)
max_val = int(torch.max(lst).item())
# Pre-allocate buckets
ranges: list[list[int]] = [[] for _ in range((max_val // interval) + 1)]
for num in data_list:
index = int(num // interval)
ranges[index].append(num)
return ranges
There was a problem hiding this comment.
fixed. Thanks for advice.
| return OmniAsyncMultiModalContentParser(self) | ||
|
|
||
|
|
||
| class OmniAsyncMultiModalContentParser(AsyncMultiModalContentParser): |
There was a problem hiding this comment.
comments from gemini:
here is a critical performance issue: your _extract_audio_from_video_async method is defined as async, but it performs blocking synchronous I/O (file downloads, librosa.load, file writes).
In an async framework like vLLM, this will freeze the entire inference engine (heartbeats, other requests, token generation) while librosa processes the video.
Signed-off-by: Gaohan123 <[email protected]>
Signed-off-by: Gaohan123 <[email protected]>
Signed-off-by: Gaohan123 <[email protected]>
Signed-off-by: Gaohan123 <[email protected]>
Signed-off-by: Gaohan123 <[email protected]>
Signed-off-by: Gaohan123 <[email protected]>
04898c5 to
fb992d3
Compare
|
lgtm, merging |
Signed-off-by: Gaohan123 <[email protected]> Signed-off-by: zjy0516 <[email protected]>
…ect#76) Signed-off-by: Gaohan123 <[email protected]>
Purpose
This PR supports multimodal inputs for qwen2.5-omni. The modalities include text, image, audio, video and their combinations. The implementation partly refers to PR #57 in vllm-omni and PR#26634 in vllm main branch.
Test Plan
Here I both tests offline and online inference. First please follow README.md in root repo to finish installation.
Online inference
Follow README.md in examples/online_serving
Offline inference
Follow README.md in examples/offline_inference/qwen2_5_omni
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)