Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions examples/offline_inference/bagel/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,24 @@ The default yaml configuration deploys Thinker and DiT on the same GPU. You can

------

#### Tensor Parallelism (TP)

For larger models or multi-GPU environments, you can enable Tensor Parallelism (TP) by modifying the stage configuration (e.g., [`bagel.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel.yaml)).

1. **Set `tensor_parallel_size`**: Increase this value (e.g., to `2` or `4`).
2. **Set `devices`**: Specify the comma-separated GPU IDs to be used for the stage (e.g., `"0,1"`).

Example configuration for TP=2 on GPUs 0 and 1:
```yaml
engine_args:
tensor_parallel_size: 2
...
runtime:
devices: "0,1"
```

------

#### 🔗 Runtime Configuration

| Parameter | Value | Description |
Expand Down
21 changes: 19 additions & 2 deletions examples/online_serving/bagel/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,29 @@ cd /workspace/vllm-omni/examples/online_serving/bagel
bash run_server.sh
```

If you have a custom stage configs file, launch the server with the command below:

```bash
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --stage-configs-path /path/to/stage_configs_file
```

#### 🚀 Tensor Parallelism (TP)

For larger models or multi-GPU environments, you can enable Tensor Parallelism (TP) for the server.

1. **Modify Stage Config**: Create or modify a stage configuration yaml (e.g., [`bagel.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel.yaml)). Set `tensor_parallel_size` to `2` (or more) and update `devices` to include multiple GPU IDs (e.g., `"0,1"`).

```yaml
engine_args:
tensor_parallel_size: 2
...
runtime:
devices: "0,1"
```

2. **Launch Server**:
```bash
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --stage-configs-path /path/to/your/custom_bagel.yaml
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is TP online serving supported by CLI argument like --tp 2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid not 😂, CLI argument tp can't overwrite the yaml config:

(APIServer pid=1640082) INFO 02-10 01:16:51 [utils.py:261] non-default args: {'model_tag': 'ByteDance-Seed/BAGEL-7B-MoT', 'port': 8091, 'model': 'ByteDance-Seed/BAGEL-7B-MoT', 'tensor_parallel_size': 2}
(APIServer pid=1640082) INFO 02-10 01:16:51 [omni.py:117] Initializing stages for model: ByteDance-Seed/BAGEL-7B-MoT
(APIServer pid=1640082) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1640082) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1640082) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1640082) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1640082) INFO 02-10 01:16:51 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('0', '1')
(APIServer pid=1640082) INFO 02-10 01:16:51 [initialization.py:234] Loaded OmniTransferConfig with 1 connector configurations
(APIServer pid=1640082) INFO 02-10 01:16:51 [factory.py:46] Created connector: SharedMemoryConnector
(APIServer pid=1640082) INFO 02-10 01:16:51 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
(APIServer pid=1640082) INFO 02-10 01:16:51 [omni_stage.py:239] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'llm', 'runtime': {'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'thinker', 'model_arch': 'BagelForConditionalGeneration', 'worker_type': 'ar', 'scheduler_cls': 'vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler', 'gpu_memory_utilization': 0.35, 'enforce_eager': True, 'trust_remote_code': True, 'engine_output_type': 'text', 'distributed_executor_backend': 'mp', 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'tensor_parallel_size': 1, 'omni_kv_config': {'need_send_cache': True, 'kv_transfer_criteria': {'type': 'prefill_finished'}}, 'max_num_seqs': 1, 'async_chunk': False}, 'final_output': True, 'final_output_type': 'text', 'is_comprehension': True, 'default_sampling_params': {'temperature': 0.4, 'top_p': 0.9, 'top_k': 1, 'max_tokens': 2048, 'seed': 52, 'detokenize': True, 'repetition_penalty': 1.05}}
(APIServer pid=1640082) INFO 02-10 01:16:51 [omni_stage.py:239] [OmniStage] stage_config: {'stage_id': 1, 'stage_type': 'diffusion', 'runtime': {'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'dit', 'gpu_memory_utilization': 0.55, 'enforce_eager': True, 'trust_remote_code': True, 'engine_output_type': 'image', 'distributed_executor_backend': 'mp', 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'tensor_parallel_size': 1, 'omni_kv_config': {'need_recv_cache': True}}, 'engine_input_source': [0], 'final_output': True, 'final_output_type': 'image', 'is_comprehension': False, 'default_sampling_params': {'seed': 52}}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lishunyang12 Can we overwrite it in the future?


### Send Multi-modal Request

Get into the bagel folder:
Expand Down
Loading