33This document outlines the architectural design for vLLM-Omni.
44
55<p align =" center " >
6- <img src =" ../source/architecture/omni-modality-model-architecture.png " alt =" Omni-Modality Model Architecture " width =" 80% " >
6+ <picture >
7+ <source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/omni-modality-model-architecture.png">
8+ <img alt="Omni-Modality Model Architecture" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/omni-modality-model-architecture.png" width=55%>
9+ </picture >
710</p >
811
912# Goals
@@ -22,26 +25,41 @@ According to analysis for current popular open-source models, most of them have
2225
2326** DiT as a main structure, with AR as text encoder (e.g.: Qwen-Image)**
2427 A powerful image generation foundation model capable of complex text rendering and precise image editing.
28+
2529<p align =" center " >
26- <img src =" ../source/architecture/ar-main-architecture.png " alt =" Qwen-Image " width =" 30% " >
30+ <picture >
31+ <source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/ar-main-architecture.png">
32+ <img alt="Qwen-Image" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/ar-main-architecture.png" width=30%>
33+ </picture >
2734</p >
2835
2936** AR as a main structure, with DiT as multi-modal generator (e.g. BAGEL)**
3037 A unified multimodal comprehension and generation model, with cot text output and visual generation.
38+
3139<p align =" center " >
32- <img src =" ../source/architecture/dit-main-architecture.png " alt =" Bagel " width =" 30% " >
40+ <picture >
41+ <source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/dit-main-architecture.png">
42+ <img alt="Bagel" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/dit-main-architecture.png" width=30%>
43+ </picture >
3344</p >
3445
3546** AR+DiT (e.g. Qwen-Omni)**
3647 A natively end-to-end omni-modal LLM for multimodal inputs (text/image/audio/video...) and outputs (text/audio...).
48+
3749<p align =" center " >
38- <img src =" ../source/architecture/ar-dit-main-architecture.png " alt =" Qwen-Omni " width =" 30% " >
50+ <picture >
51+ <source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/ar-dit-main-architecture.png">
52+ <img alt="Qwen-Omni" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/ar-dit-main-architecture.png" width=30%>
53+ </picture >
3954</p >
4055
4156# vLLM-Omni main architecture
4257
4358<p align =" center " >
44- <img src =" ../source/architecture/vllm-omni-main-architecture.png " alt =" vLLM-Omni Main Architecture " width =" 80% " >
59+ <picture >
60+ <source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/vllm-omni-main-architecture.png">
61+ <img alt="vLLM-Omni Main Architecture" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/vllm-omni-main-architecture.png" width=55%>
62+ </picture >
4563</p >
4664
4765## Key Components
@@ -89,7 +107,12 @@ vLLM-Omni is designed to be flexible and straightforward for users:
89107
90108If you use vLLM, then you know how to use vLLM-Omni from Day 0:
91109
92- ![ vLLM-Omni interface design] ( ../source/architecture/vllm-omni-user-interface.png )
110+ <p align =" center " >
111+ <picture >
112+ <source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/vllm-omni-user-interface.png">
113+ <img alt="vLLM-Omni interface design" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/vllm-omni-user-interface.png" width=55%>
114+ </picture >
115+ </p >
93116
94117Taking ** Qwen3-Omni** as an example:
95118
0 commit comments