Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c26b10bb49
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
596fce9 to
1b69840
Compare
| @@ -0,0 +1,102 @@ | |||
| # Stage config for running Bagel (AR only) | |||
There was a problem hiding this comment.
regarding the yaml design:
- what do you think we need to make changes to the yaml in order to support DP for omnistages? Currently, I think this is related to a DP coordinator and device mesh
- what do we need to keep under this hidden yaml and what do we need to move it to enreypoints and open it to cli?
|
Please attach your design doc using this template in your related RFC :) |
| client = OmniDiffusion(model=model_name) | ||
|
|
||
| generate_kwargs = { | ||
| "prompt": args.prompts, |
There was a problem hiding this comment.
Later we should unify these kwargs into SamplingParams like classes
|
|
||
|
|
||
| @dataclass | ||
| class KVCacheTransferData: |
There was a problem hiding this comment.
What is the boundary for KV transfer here or current hiddenstates transfer by stage worker?
| # 3. Return False means "Do NOT stop the request" -> Continue Decoding | ||
| return False | ||
|
|
||
| elif criteria_type == "special_token": |
There was a problem hiding this comment.
For criteria_type, how is the generalizability for other models in the future?
|
|
||
| # Ensure scheduled_new_reqs carry omni-specific payloads | ||
| # (e.g., additional_information) | ||
| def schedule(self) -> SchedulerOutput: # type: ignore[override] |
There was a problem hiding this comment.
Does it support KV transfer by chunks continuously? For chunked comm and computation across stages, can it be unified to the implementation?
|
Because the kv cache transmission is now being implemented through intrusive changes in the GPU AR runner and GPU diffusion worker, the relevant code needs to be abstracted(Abstracted Runner should support key-value extraction and reception. May be we need add runner in diffusion?). Right now the extracted key-value cache data format must completely conform to Bagel's format, which also need decouple, and the related unit tests also need modification. |
|
shall we wait for #800 merged and refactor this PR? |
|
Yes, we discussed with jiangyun yesterday and decided merge diffusion model runner first. |
bf7cfe2 to
1aee1d5
Compare
|
@hsliuustc0106 @Gaohan123 @ZJY0516 I rebased it and add online inference example. Please review it.😊 |
ZJY0516
left a comment
There was a problem hiding this comment.
This PR is quite large now
| gen_input_vae[k] = v.to(self.device) | ||
|
|
||
| # VAE needs bfloat16 to match model strings usually, specifically encode | ||
| with torch.autocast(device_type="cuda", enabled=self.device.type == "cuda", dtype=torch.bfloat16): |
There was a problem hiding this comment.
Why we need to hardcode "cuda" and bf16 here? I remember we load model using bf16 by default
| if torch.is_tensor(v): | ||
| gen_input_img[k] = v.to(self.device) | ||
|
|
||
| with torch.autocast(device_type="cuda", enabled=self.device.type == "cuda", dtype=torch.bfloat16): |
| logger.error(f"SharedMemoryConnector get failed for req {request_id}: {e}") | ||
| return None | ||
|
|
||
| if "shm" in metadata: |
There was a problem hiding this comment.
Do I miss something? Why is this logic placed after the return None? It looks like dead code.
Co-authored-by: wzliu <[email protected]> Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
f0382e6 to
2655414
Compare
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
|
@hsliuustc0106 @ZJY0516 @natureofnature Can we merge now? |
let me run the ci now |
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
Purpose
This PR enables stage-based deployment for the Bagel model, aligning it with the vllm-omni architecture. Specific changes include:
vllm_omni/model_executor/stage_configs/bagel.yamlto define the multi-stage pipeline (Thinker/AR stage + Diffusion/DiT stage).bagelmodel implementation by removing monolithic files and enabling specialized components for each stage:BagelForConditionalGeneration(AR mode) for multimodal understanding and text generation.BagelForConditionalGeneration(likely wrapping the diffusion pipeline) for image generation.KV Cache Transfer Design
sequenceDiagram participant Sched as Stage 0 AR Scheduler participant AR_Runner as Stage 0 AR GPU Runner participant Conn as OmniConnector participant DiT_Runner as Stage 1 DiT GPU Runner Note over Sched, AR_Runner: Stage 0 (AR / LLM Phase) Sched->>Sched: 1. Trigger Transfer Note right of Sched: e.g. prefill done Sched->>AR_Runner: 2. Signal: Send Block IDs AR_Runner->>AR_Runner: 3. Extract KV (GPU -> Host) AR_Runner->>Conn: 4. Put KV Cache (IPC/Network) Note over DiT_Runner: Stage 1 (Diffusion Phase) DiT_Runner->>Conn: 5. Waiting / Get KV Data Conn-->>DiT_Runner: Return KV Data DiT_Runner->>DiT_Runner: 6. Load KV to GPU DiT_Runner->>DiT_Runner: 7. Run DiffusionOnline Inference
Test Plan
FLASHINFER_DISABLE_VERSION_CHECK=1 vllm serve "../models/BAGEL-7B-MoT" --omni --port 8091FLASHINFER_DISABLE_VERSION_CHECK=1 python examples/online_serving/bagel/openai_chat_client.py --prompt "A cute cat" --modality text2imgResult
Details
Text2Image (Stage 0, Stage 1)
Test Plan
SharedMemory:
Mooncake:
Result
Details
Image2Text (Only Stage 0)
Test Plan
Image used for test:
Details
Result
Text2Text (Only Stage 0)
Test Plan
Result
Image2Image(Directly using
OmniDiffusion, because now we can't skip Stage 0)Test Plan
Image used for test:
Details
Result
Details
Limitation