Skip to content

Support Bagel Model#726

Merged
hsliuustc0106 merged 29 commits intovllm-project:mainfrom
princepride:new-bagel-model-stage
Jan 22, 2026
Merged

Support Bagel Model#726
hsliuustc0106 merged 29 commits intovllm-project:mainfrom
princepride:new-bagel-model-stage

Conversation

@princepride
Copy link
Collaborator

@princepride princepride commented Jan 10, 2026

Purpose

This PR enables stage-based deployment for the Bagel model, aligning it with the vllm-omni architecture. Specific changes include:

  1. Added Stage Configuration: Introduced vllm_omni/model_executor/stage_configs/bagel.yaml to define the multi-stage pipeline (Thinker/AR stage + Diffusion/DiT stage).
  2. Refactored Model Structure: Cleaned up the bagel model implementation by removing monolithic files and enabling specialized components for each stage:
    • Stage 0 (Thinker): Uses BagelForConditionalGeneration (AR mode) for multimodal understanding and text generation.
    • Stage 1 (Diffusion): Uses BagelForConditionalGeneration (likely wrapping the diffusion pipeline) for image generation.

KV Cache Transfer Design

sequenceDiagram
    participant Sched as Stage 0 AR Scheduler
    participant AR_Runner as Stage 0 AR GPU Runner
    participant Conn as OmniConnector
    participant DiT_Runner as Stage 1 DiT GPU Runner

    Note over Sched, AR_Runner: Stage 0 (AR / LLM Phase)
    
    Sched->>Sched: 1. Trigger Transfer
    Note right of Sched: e.g. prefill done
    
    Sched->>AR_Runner: 2. Signal: Send Block IDs
    
    AR_Runner->>AR_Runner: 3. Extract KV (GPU -> Host)
    AR_Runner->>Conn: 4. Put KV Cache (IPC/Network)
    
    Note over DiT_Runner: Stage 1 (Diffusion Phase)
    
    DiT_Runner->>Conn: 5. Waiting / Get KV Data
    Conn-->>DiT_Runner: Return KV Data
    DiT_Runner->>DiT_Runner: 6. Load KV to GPU
    DiT_Runner->>DiT_Runner: 7. Run Diffusion
Loading

Online Inference

Test Plan

FLASHINFER_DISABLE_VERSION_CHECK=1 vllm serve "../models/BAGEL-7B-MoT" --omni --port 8091
FLASHINFER_DISABLE_VERSION_CHECK=1 python examples/online_serving/bagel/openai_chat_client.py --prompt "A cute cat" --modality text2img

Result

Details image

Text2Image (Stage 0, Stage 1)

Test Plan

SharedMemory:

FLASHINFER_DISABLE_VERSION_CHECK=1 python3 examples/offline_inference/bagel/end2end.py --prompts "A cute cat" --modality text2img

Mooncake:

# primary node

# if you use mooncake SSD storage
mkdir -p ./mc_storage #optional 

mooncake_master \
  --rpc_port=50051 \
  --enable_http_metadata_server=true \
  --http_metadata_server_host=0.0.0.0 \
  --http_metadata_server_port=8080 \
  --metrics_port=9003 \
  --root_fs_dir=./mc_storage/ \
  --cluster_id=mc-local-1 &

# vllm-omni server

FLASHINFER_DISABLE_VERSION_CHECK=1 python3 examples/offline_inference/bagel/end2end.py --prompts "A cute cat" --modality text2img --stage-configs-path vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml

Result

Details image

Image2Text (Only Stage 0)

Test Plan

Image used for test:

Details image
FLASHINFER_DISABLE_VERSION_CHECK=1 python3 examples/offline_inference/bagel/end2end.py --prompts "Please describe this image." --modality img2text --image-path women.jpg

Result

The image depicts a person with short, dark hair styled in a neat manner. The individual is wearing a bright red garment that appears to be a jacket or coat, which stands out against the background. The background consists of small, irregularly shaped stones or pebbles, creating a textured surface. The person's hand is visible, resting on their face, and they are wearing a white buttoned cuff on their sleeve. The overall composition suggests a casual yet stylish appearance.

Text2Text (Only Stage 0)

Test Plan

FLASHINFER_DISABLE_VERSION_CHECK=1 python3 examples/offline_inference/bagel/end2end.py --prompts "What is the capital of France?" --modality text2text

Result

The capital of France is Paris.

Image2Image(Directly using OmniDiffusion, because now we can't skip Stage 0)

Test Plan

Image used for test:

Details image
FLASHINFER_DISABLE_VERSION_CHECK=1 python3 examples/offline_inference/bagel/end2end.py --prompts "Let the woman wear a blue dress" --modality img2img --image-path women.jpg

Result

Details image

Limitation

  • Multi-turn dialogue is not supported.
  • Batch inference is not supported
  • Thinking model is not supported.
  • Need support RDMA to transfer KV Cache in the future.
  • Cache-Dit is not supported.
  • AR model not support VAE module

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c26b10bb49

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@@ -0,0 +1,102 @@
# Stage config for running Bagel (AR only)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding the yaml design:

  1. what do you think we need to make changes to the yaml in order to support DP for omnistages? Currently, I think this is related to a DP coordinator and device mesh
  2. what do we need to keep under this hidden yaml and what do we need to move it to enreypoints and open it to cli?

@hsliuustc0106
Copy link
Collaborator

Please attach your design doc using this template in your related RFC :)

Copy link
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comments for your reference. BTW, why the diffault usage is interface of mooncake, can we just use vllm serve as default init frontend?

client = OmniDiffusion(model=model_name)

generate_kwargs = {
"prompt": args.prompts,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later we should unify these kwargs into SamplingParams like classes



@dataclass
class KVCacheTransferData:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the boundary for KV transfer here or current hiddenstates transfer by stage worker?

# 3. Return False means "Do NOT stop the request" -> Continue Decoding
return False

elif criteria_type == "special_token":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For criteria_type, how is the generalizability for other models in the future?


# Ensure scheduled_new_reqs carry omni-specific payloads
# (e.g., additional_information)
def schedule(self) -> SchedulerOutput: # type: ignore[override]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it support KV transfer by chunks continuously? For chunked comm and computation across stages, can it be unified to the implementation?

@princepride
Copy link
Collaborator Author

Because the kv cache transmission is now being implemented through intrusive changes in the GPU AR runner and GPU diffusion worker, the relevant code needs to be abstracted(Abstracted Runner should support key-value extraction and reception. May be we need add runner in diffusion?). Right now the extracted key-value cache data format must completely conform to Bagel's format, which also need decouple, and the related unit tests also need modification.

@hsliuustc0106
Copy link
Collaborator

shall we wait for #800 merged and refactor this PR?

@princepride
Copy link
Collaborator Author

Yes, we discussed with jiangyun yesterday and decided merge diffusion model runner first.

@princepride
Copy link
Collaborator Author

@hsliuustc0106 @Gaohan123 @ZJY0516 I rebased it and add online inference example. Please review it.😊

Copy link
Collaborator

@ZJY0516 ZJY0516 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is quite large now

gen_input_vae[k] = v.to(self.device)

# VAE needs bfloat16 to match model strings usually, specifically encode
with torch.autocast(device_type="cuda", enabled=self.device.type == "cuda", dtype=torch.bfloat16):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to hardcode "cuda" and bf16 here? I remember we load model using bf16 by default

if torch.is_tensor(v):
gen_input_img[k] = v.to(self.device)

with torch.autocast(device_type="cuda", enabled=self.device.type == "cuda", dtype=torch.bfloat16):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

logger.error(f"SharedMemoryConnector get failed for req {request_id}: {e}")
return None

if "shm" in metadata:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I miss something? Why is this logic placed after the return None? It looks like dead code.

princepride and others added 6 commits January 21, 2026 02:05
Co-authored-by: wzliu <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
@princepride princepride force-pushed the new-bagel-model-stage branch from f0382e6 to 2655414 Compare January 21, 2026 02:14
Signed-off-by: princepride <[email protected]>
@princepride
Copy link
Collaborator Author

@hsliuustc0106 @ZJY0516 @natureofnature Can we merge now?

@hsliuustc0106
Copy link
Collaborator

@hsliuustc0106 @ZJY0516 @natureofnature Can we merge now?

let me run the ci now

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Jan 21, 2026
Signed-off-by: princepride <[email protected]>
@david6666666 david6666666 added this to the v0.14.0rc1 milestone Jan 22, 2026
Signed-off-by: princepride <[email protected]>
@hsliuustc0106 hsliuustc0106 merged commit 7f821be into vllm-project:main Jan 22, 2026
6 of 7 checks passed
@princepride princepride mentioned this pull request Feb 21, 2026
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants