Conversation
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 950ce0cc81
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: princepride <wangzhipeng628@gmail.com>
|
@hsliuustc0106 PTAL |
Signed-off-by: princepride <wangzhipeng628@gmail.com>
|
update examples as well |
There was a problem hiding this comment.
Pull request overview
Adds tensor-parallel (TP) compatibility for the BAGEL diffusion pipeline by replacing non-TP-aware HF components with vLLM TP layers and updating weight-loading / vocab checks accordingly (addresses #1253).
Changes:
- Switch BAGEL’s Qwen2 MoT MLP, embedding, norms, and RoPE to vLLM TP-aware implementations and add TP-aware
load_weightson the BAGEL LM module. - Update BAGEL pipeline vocab mismatch checks to use global
vocab_size(instead of local embedding shard size under TP). - Make BAGEL pipeline weight filtering TP-aware by allowing shape mismatches for parameters that have a vLLM
weight_loader.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
vllm_omni/diffusion/models/bagel/pipeline_bagel.py |
Uses global vocab size for safety checks and relaxes shape checks for TP-sharded parameters during weight loading. |
vllm_omni/diffusion/models/bagel/bagel_transformer.py |
Introduces TP-aware rotary embedding + MLP and swaps core layers to vLLM TP primitives; adds TP-aware LM weight loading. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: princepride <wangzhipeng628@gmail.com>
|
@hsliuustc0106 Ready to merge. |
|
The gpu mem utilization indicates that some linear layers are not splited. |
Signed-off-by: princepride <wangzhipeng628@gmail.com>
|
@ZJY0516 Mainly copied from: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/qwen2.py: Qwen2Attention, because bagel used a different rope and add a lot of qkv_moe module so I didn't inherit it, I also update the memory usage of the new version of this model. |
| 2. **Launch Server**: | ||
| ```bash | ||
| vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --stage-configs-path /path/to/your/custom_bagel.yaml | ||
| ``` |
There was a problem hiding this comment.
Is TP online serving supported by CLI argument like --tp 2?
There was a problem hiding this comment.
I am afraid not 😂, CLI argument tp can't overwrite the yaml config:
(APIServer pid=1640082) INFO 02-10 01:16:51 [utils.py:261] non-default args: {'model_tag': 'ByteDance-Seed/BAGEL-7B-MoT', 'port': 8091, 'model': 'ByteDance-Seed/BAGEL-7B-MoT', 'tensor_parallel_size': 2}
(APIServer pid=1640082) INFO 02-10 01:16:51 [omni.py:117] Initializing stages for model: ByteDance-Seed/BAGEL-7B-MoT
(APIServer pid=1640082) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1640082) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1640082) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1640082) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1640082) INFO 02-10 01:16:51 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('0', '1')
(APIServer pid=1640082) INFO 02-10 01:16:51 [initialization.py:234] Loaded OmniTransferConfig with 1 connector configurations
(APIServer pid=1640082) INFO 02-10 01:16:51 [factory.py:46] Created connector: SharedMemoryConnector
(APIServer pid=1640082) INFO 02-10 01:16:51 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
(APIServer pid=1640082) INFO 02-10 01:16:51 [omni_stage.py:239] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'llm', 'runtime': {'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'thinker', 'model_arch': 'BagelForConditionalGeneration', 'worker_type': 'ar', 'scheduler_cls': 'vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler', 'gpu_memory_utilization': 0.35, 'enforce_eager': True, 'trust_remote_code': True, 'engine_output_type': 'text', 'distributed_executor_backend': 'mp', 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'tensor_parallel_size': 1, 'omni_kv_config': {'need_send_cache': True, 'kv_transfer_criteria': {'type': 'prefill_finished'}}, 'max_num_seqs': 1, 'async_chunk': False}, 'final_output': True, 'final_output_type': 'text', 'is_comprehension': True, 'default_sampling_params': {'temperature': 0.4, 'top_p': 0.9, 'top_k': 1, 'max_tokens': 2048, 'seed': 52, 'detokenize': True, 'repetition_penalty': 1.05}}
(APIServer pid=1640082) INFO 02-10 01:16:51 [omni_stage.py:239] [OmniStage] stage_config: {'stage_id': 1, 'stage_type': 'diffusion', 'runtime': {'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'dit', 'gpu_memory_utilization': 0.55, 'enforce_eager': True, 'trust_remote_code': True, 'engine_output_type': 'image', 'distributed_executor_backend': 'mp', 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'tensor_parallel_size': 1, 'omni_kv_config': {'need_recv_cache': True}}, 'engine_input_source': [0], 'final_output': True, 'final_output_type': 'image', 'is_comprehension': False, 'default_sampling_params': {'seed': 52}}
There was a problem hiding this comment.
@lishunyang12 Can we overwrite it in the future?
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Purpose
#1253 Let Bagel support TP
Test Plan
Test Result
Model output:
Details
TP = 1memory usage:Details
TP = 2memory usage:Details