Support Bagel Model by princepride · Pull Request #726 · vllm-project/vllm-omni

princepride · 2026-01-10T04:16:50Z

Purpose

This PR enables stage-based deployment for the Bagel model, aligning it with the vllm-omni architecture. Specific changes include:

Added Stage Configuration: Introduced vllm_omni/model_executor/stage_configs/bagel.yaml to define the multi-stage pipeline (Thinker/AR stage + Diffusion/DiT stage).
Refactored Model Structure: Cleaned up the bagel model implementation by removing monolithic files and enabling specialized components for each stage:
- Stage 0 (Thinker): Uses BagelForConditionalGeneration (AR mode) for multimodal understanding and text generation.
- Stage 1 (Diffusion): Uses BagelForConditionalGeneration (likely wrapping the diffusion pipeline) for image generation.

KV Cache Transfer Design

sequenceDiagram
    participant Sched as Stage 0 AR Scheduler
    participant AR_Runner as Stage 0 AR GPU Runner
    participant Conn as OmniConnector
    participant DiT_Runner as Stage 1 DiT GPU Runner

    Note over Sched, AR_Runner: Stage 0 (AR / LLM Phase)
    
    Sched->>Sched: 1. Trigger Transfer
    Note right of Sched: e.g. prefill done
    
    Sched->>AR_Runner: 2. Signal: Send Block IDs
    
    AR_Runner->>AR_Runner: 3. Extract KV (GPU -> Host)
    AR_Runner->>Conn: 4. Put KV Cache (IPC/Network)
    
    Note over DiT_Runner: Stage 1 (Diffusion Phase)
    
    DiT_Runner->>Conn: 5. Waiting / Get KV Data
    Conn-->>DiT_Runner: Return KV Data
    DiT_Runner->>DiT_Runner: 6. Load KV to GPU
    DiT_Runner->>DiT_Runner: 7. Run Diffusion

Online Inference

Test Plan

FLASHINFER_DISABLE_VERSION_CHECK=1 vllm serve "../models/BAGEL-7B-MoT" --omni --port 8091

FLASHINFER_DISABLE_VERSION_CHECK=1 python examples/online_serving/bagel/openai_chat_client.py --prompt "A cute cat" --modality text2img

Result

Details

Text2Image (Stage 0, Stage 1)

Test Plan

SharedMemory:

FLASHINFER_DISABLE_VERSION_CHECK=1 python3 examples/offline_inference/bagel/end2end.py --prompts "A cute cat" --modality text2img

Mooncake:

# primary node

# if you use mooncake SSD storage
mkdir -p ./mc_storage #optional 

mooncake_master \
  --rpc_port=50051 \
  --enable_http_metadata_server=true \
  --http_metadata_server_host=0.0.0.0 \
  --http_metadata_server_port=8080 \
  --metrics_port=9003 \
  --root_fs_dir=./mc_storage/ \
  --cluster_id=mc-local-1 &

# vllm-omni server

FLASHINFER_DISABLE_VERSION_CHECK=1 python3 examples/offline_inference/bagel/end2end.py --prompts "A cute cat" --modality text2img --stage-configs-path vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml

Result

Details

Image2Text (Only Stage 0)

Test Plan

Image used for test:

Details

FLASHINFER_DISABLE_VERSION_CHECK=1 python3 examples/offline_inference/bagel/end2end.py --prompts "Please describe this image." --modality img2text --image-path women.jpg

Result

The image depicts a person with short, dark hair styled in a neat manner. The individual is wearing a bright red garment that appears to be a jacket or coat, which stands out against the background. The background consists of small, irregularly shaped stones or pebbles, creating a textured surface. The person's hand is visible, resting on their face, and they are wearing a white buttoned cuff on their sleeve. The overall composition suggests a casual yet stylish appearance.

Text2Text (Only Stage 0)

Test Plan

FLASHINFER_DISABLE_VERSION_CHECK=1 python3 examples/offline_inference/bagel/end2end.py --prompts "What is the capital of France?" --modality text2text

Result

The capital of France is Paris.

Image2Image(Directly using `OmniDiffusion`, because now we can't skip Stage 0)

Test Plan

Image used for test:

Details

FLASHINFER_DISABLE_VERSION_CHECK=1 python3 examples/offline_inference/bagel/end2end.py --prompts "Let the woman wear a blue dress" --modality img2img --image-path women.jpg

Result

Details

Limitation

Multi-turn dialogue is not supported.
Batch inference is not supported
Thinking model is not supported.
Need support RDMA to transfer KV Cache in the future.
Cache-Dit is not supported.
AR model not support VAE module

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c26b10bb49

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm_omni/worker/gpu_ar_model_runner.py

examples/offline_inference/bagel/end2end.py

hsliuustc0106 · 2026-01-10T06:47:37Z

vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml

@@ -0,0 +1,102 @@
+# Stage config for running Bagel (AR only)


regarding the yaml design:

what do you think we need to make changes to the yaml in order to support DP for omnistages? Currently, I think this is related to a DP coordinator and device mesh

what do we need to keep under this hidden yaml and what do we need to move it to enreypoints and open it to cli?

vllm_omni/diffusion/worker/gpu_worker.py

vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml

examples/offline_inference/bagel/end2end.py

examples/offline_inference/bagel/run_single_prompt.sh

tests/distributed/omni_connectors/test_kv_flow.py

vllm_omni/diffusion/request.py

vllm_omni/model_executor/models/bagel/bagel_processor.py

vllm_omni/model_executor/models/registry.py

vllm_omni/__init__.py

hsliuustc0106 · 2026-01-12T05:42:10Z

Please attach your design doc using this template in your related RFC :)

Gaohan123

Please see the comments for your reference. BTW, why the diffault usage is interface of mooncake, can we just use vllm serve as default init frontend?

examples/offline_inference/bagel/end2end.py

Gaohan123 · 2026-01-12T09:41:46Z

examples/offline_inference/bagel/end2end.py

+        client = OmniDiffusion(model=model_name)
+
+        generate_kwargs = {
+            "prompt": args.prompts,


Later we should unify these kwargs into SamplingParams like classes

Gaohan123 · 2026-01-12T09:42:51Z

vllm_omni/core/sched/omni_ar_scheduler.py

+
+
+@dataclass
+class KVCacheTransferData:


What is the boundary for KV transfer here or current hiddenstates transfer by stage worker?

Gaohan123 · 2026-01-12T09:43:52Z

vllm_omni/core/sched/omni_ar_scheduler.py

+                # 3. Return False means "Do NOT stop the request" -> Continue Decoding
+                return False
+
+        elif criteria_type == "special_token":


For criteria_type, how is the generalizability for other models in the future?

Gaohan123 · 2026-01-12T09:47:21Z

vllm_omni/core/sched/omni_ar_scheduler.py

+
    # Ensure scheduled_new_reqs carry omni-specific payloads
    # (e.g., additional_information)
    def schedule(self) -> SchedulerOutput:  # type: ignore[override]


Does it support KV transfer by chunks continuously? For chunked comm and computation across stages, can it be unified to the implementation?

vllm_omni/entrypoints/omni.py

vllm_omni/model_executor/stage_configs/bagel.yaml

vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml

princepride · 2026-01-14T06:13:15Z

Because the kv cache transmission is now being implemented through intrusive changes in the GPU AR runner and GPU diffusion worker, the relevant code needs to be abstracted(Abstracted Runner should support key-value extraction and reception. May be we need add runner in diffusion?). Right now the extracted key-value cache data format must completely conform to Bagel's format, which also need decouple, and the related unit tests also need modification.

hsliuustc0106 · 2026-01-17T00:56:55Z

shall we wait for #800 merged and refactor this PR?

princepride · 2026-01-17T01:27:23Z

Yes, we discussed with jiangyun yesterday and decided merge diffusion model runner first.

princepride · 2026-01-18T06:31:25Z

@hsliuustc0106 @Gaohan123 @ZJY0516 I rebased it and add online inference example. Please review it.😊

ZJY0516

This PR is quite large now

ZJY0516 · 2026-01-18T15:10:15Z

vllm_omni/diffusion/models/bagel/pipeline_bagel.py

+                            gen_input_vae[k] = v.to(self.device)
+
+                    # VAE needs bfloat16 to match model strings usually, specifically encode
+                    with torch.autocast(device_type="cuda", enabled=self.device.type == "cuda", dtype=torch.bfloat16):


Why we need to hardcode "cuda" and bf16 here? I remember we load model using bf16 by default

ZJY0516 · 2026-01-18T15:10:32Z

vllm_omni/diffusion/models/bagel/pipeline_bagel.py

+                        if torch.is_tensor(v):
+                            gen_input_img[k] = v.to(self.device)
+
+                    with torch.autocast(device_type="cuda", enabled=self.device.type == "cuda", dtype=torch.bfloat16):


vllm_omni/diffusion/worker/gpu_diffusion_model_runner.py

ZJY0516 · 2026-01-18T15:16:39Z

vllm_omni/distributed/omni_connectors/connectors/shm_connector.py

            logger.error(f"SharedMemoryConnector get failed for req {request_id}: {e}")
            return None

+            if "shm" in metadata:


Do I miss something? Why is this logic placed after the return None? It looks like dead code.

vllm_omni/__init__.py

Co-authored-by: wzliu <[email protected]> Signed-off-by: princepride <[email protected]>

Signed-off-by: princepride <[email protected]>

vllm_omni/worker/gpu_ar_model_runner.py

Signed-off-by: princepride <[email protected]>

princepride · 2026-01-21T13:09:05Z

@hsliuustc0106 @ZJY0516 @natureofnature Can we merge now?

hsliuustc0106 · 2026-01-21T15:56:43Z

@hsliuustc0106 @ZJY0516 @natureofnature Can we merge now?

let me run the ci now

Signed-off-by: princepride <[email protected]>

tests/distributed/omni_connectors/test_kv_flow.py

Signed-off-by: princepride <[email protected]>

princepride requested a review from hsliuustc0106 as a code owner January 10, 2026 04:16

chatgpt-codex-connector bot reviewed Jan 10, 2026

View reviewed changes

vllm_omni/worker/gpu_ar_model_runner.py Outdated Show resolved Hide resolved

princepride force-pushed the new-bagel-model-stage branch from 596fce9 to 1b69840 Compare January 10, 2026 04:22

This was referenced Jan 10, 2026

[RFC]: vLLM-Omni 2026 Q1 Roadmap #677

Open

[Model] Bagel model implement with stage #595

Closed

hsliuustc0106 reviewed Jan 10, 2026

View reviewed changes

princepride mentioned this pull request Jan 10, 2026

[BugFix] Fix assuming all stage model have talker #730

Merged

david6666666 mentioned this pull request Jan 12, 2026

[Feature][Community]: Model Support: Add support for Bagel and HunyuanImage3.0 models in disaggregated setups. JiusiServe/vllm-omni#33

Open

3 tasks

ZJY0516 reviewed Jan 12, 2026

View reviewed changes

Gaohan123 reviewed Jan 12, 2026

View reviewed changes

hsliuustc0106 reviewed Jan 12, 2026

View reviewed changes

vllm_omni/model_executor/stage_configs/bagel.yaml Show resolved Hide resolved

vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml Show resolved Hide resolved

hsliuustc0106 mentioned this pull request Jan 14, 2026

[Model] Fun cosy voice3-0.5-b-2512 #498

Open

5 tasks

david6666666 mentioned this pull request Jan 16, 2026

vLLM-Omni Model Support #808

Open

57 tasks

hsliuustc0106 mentioned this pull request Jan 17, 2026

add support for MammothModa2 model #336

Open

5 tasks

princepride force-pushed the new-bagel-model-stage branch from bf7cfe2 to 1aee1d5 Compare January 18, 2026 05:36

ZJY0516 reviewed Jan 18, 2026

View reviewed changes

princepride requested review from Gaohan123, ZJY0516 and hsliuustc0106 January 19, 2026 05:48

princepride and others added 6 commits January 21, 2026 02:05

Support Bagel Model

9e7323f

Co-authored-by: wzliu <[email protected]> Signed-off-by: princepride <[email protected]>

remove reqest_ids and request_id assert exist

5d5db43

Signed-off-by: princepride <[email protected]>

move bagel end2end to examples&delete useless parameters in yaml

9085814

Signed-off-by: princepride <[email protected]>

remove useless comment

5f4a472

Signed-off-by: princepride <[email protected]>

fix some problem

bb094c7

Signed-off-by: princepride <[email protected]>

xxx

339faca

Signed-off-by: princepride <[email protected]>

princepride added 11 commits January 21, 2026 02:05

fix pre-commit error

055894f

Signed-off-by: princepride <[email protected]>

move inject_omni_kv_config to utils

7948da1

Signed-off-by: princepride <[email protected]>

simplify ar model runner extract and transfer kv cache code

72110c0

Signed-off-by: princepride <[email protected]>

simplify ar model runner extract and transfer kv cache code

01a90b3

Signed-off-by: princepride <[email protected]>

xxx

695b1de

Signed-off-by: princepride <[email protected]>

move customer config under /transformers_utils/configs

0bea3bc

Signed-off-by: princepride <[email protected]>

move customer config under /transformers_utils/configs

866442c

Signed-off-by: princepride <[email protected]>

move customer config under /transformers_utils/configs

f8fafdd

Signed-off-by: princepride <[email protected]>

move customer processor under /transformers_utils/processors

e87d8f9

Signed-off-by: princepride <[email protected]>

remove useless comment

1c0622e

Signed-off-by: princepride <[email protected]>

remove useless comment

2655414

Signed-off-by: princepride <[email protected]>

princepride force-pushed the new-bagel-model-stage branch from f0382e6 to 2655414 Compare January 21, 2026 02:14

princepride added 3 commits January 21, 2026 02:16

fix pre-commit error

e01d690

Signed-off-by: princepride <[email protected]>

remove ar bagel because vllm already have

33b20e1

Signed-off-by: princepride <[email protected]>

remove ar bagel because vllm already have

f2f620f

Signed-off-by: princepride <[email protected]>

natureofnature reviewed Jan 21, 2026

View reviewed changes

vllm_omni/worker/gpu_ar_model_runner.py Outdated Show resolved Hide resolved

remove useless code

8894708

Signed-off-by: princepride <[email protected]>

hsliuustc0106 added the ready label to trigger buildkite CI label Jan 21, 2026

fix some bug

ecad2ea

Signed-off-by: princepride <[email protected]>

hsliuustc0106 approved these changes Jan 22, 2026

View reviewed changes

tests/distributed/omni_connectors/test_kv_flow.py Show resolved Hide resolved

princepride added 2 commits January 22, 2026 08:59

add test_kv_flow to buildkite

4a76c7f

Signed-off-by: princepride <[email protected]>

Merge branch 'main' into new-bagel-model-stage

ca78f31

david6666666 added this to the v0.14.0rc1 milestone Jan 22, 2026

remove duplicate code

1a4ecab

Signed-off-by: princepride <[email protected]>

hsliuustc0106 merged commit 7f821be into vllm-project:main Jan 22, 2026
6 of 7 checks passed

natureofnature mentioned this pull request Jan 23, 2026

[Feature]: Multi-Node Serving: Support for multi-node setups using the Mooncake Connector. JiusiServe/vllm-omni#31

Open

1 task

yma11 mentioned this pull request Feb 2, 2026

[RFC]: vLLM-Omni XPU 2026 Q1 Roadmap #1127

Open

5 tasks

princepride mentioned this pull request Feb 21, 2026

[RFC]: Bagel deployment #936

Open

14 tasks



		@dataclass
		class KVCacheTransferData:

Conversation

princepride commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

KV Cache Transfer Design

Online Inference

Test Plan

Result

Text2Image (Stage 0, Stage 1)

Test Plan

Result

Image2Text (Only Stage 0)

Test Plan

Result

Text2Text (Only Stage 0)

Test Plan

Result

Image2Image(Directly using OmniDiffusion, because now we can't skip Stage 0)

Test Plan

Result

Limitation

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 commented Jan 12, 2026

Uh oh!

Gaohan123 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

princepride commented Jan 14, 2026

Uh oh!

hsliuustc0106 commented Jan 17, 2026

Uh oh!

princepride commented Jan 17, 2026

Uh oh!

princepride commented Jan 18, 2026

Uh oh!

ZJY0516 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

princepride commented Jan 10, 2026 •

edited

Loading

Image2Image(Directly using `OmniDiffusion`, because now we can't skip Stage 0)

Gaohan123 left a comment •

edited

Loading