Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,6 @@ nav:
- design/architecture_overview.md
- Feature Design:
- design/feature/disaggregated_inference.md
- design/feature/multi_request_streaming.md
- design/feature/ray_based_execution.md
- Module Design:
- design/module/ar_module.md
Expand Down
47 changes: 0 additions & 47 deletions docs/design/feature/multi_request_streaming.md

This file was deleted.

22 changes: 10 additions & 12 deletions docs/design/feature/ray_based_execution.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
# Distributed utils

This directory (vllm_omni/distributed/ray_utils) contains utilities for distributed execution in vllm-omni, supporting both **Ray** and **Multiprocessing** backends.

## 1. Ray Utils
## 1. Installation
```bash
pip install "ray[default]"
```
## 2. Ray Utils

The `ray_utils` module provides helper functions for managing Ray clusters and actors, which is essential for:
* **Multi-node deployment**: Running pipeline stages across different physical machines.
* **Resource management**: Efficient GPU/CPU allocation.

### 1.1 Basic Usage
### 2.1 Basic Usage

To use the Ray backend, specify `worker_backend="ray"` when initializing the engine.

Expand All @@ -21,7 +24,7 @@ vllm serve Qwen/Qwen2.5-Omni-7B \
--ray-address auto
```

### 1.2 Cluster Setup
### 2.2 Cluster Setup

**Step 1: Start Head Node**
Run this on your primary machine:
Expand All @@ -38,24 +41,19 @@ ray start --address=<HEAD_NODE_IP>:6399
> **Tip**: For a complete cluster setup script, refer to the vLLM example:
> [run_cluster.sh](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh)

### 1.3 Distributed Connector Support
### 2.3 Distributed Connector Support

When running on Ray, the system automatically adapts its communication strategy:

* **Cross-Node**: Recommended to use `MooncakeConnector` (requires separate configuration).
* **Same-Node**: Can still use `SharedMemoryConnector` for efficiency, or Ray's native object store (plasma).
* **SHM threshold default differs**: when `worker_backend="ray"`, the SharedMemoryConnector default threshold is set to `sys.maxsize`, which forces payloads to go inline (no SHM). Override `shm_threshold_bytes` in the connector config if you want SHM for Ray runs.

### 1.4 Internal Helpers
### 2.4 Internal Helpers

* **`initialize_ray_cluster`**: Connects to an existing Ray cluster or starts a local one.

## 2. Troubleshooting
## 3. Troubleshooting

* **Connection Issues**: Ensure the Ray head node is accessible and ports (default 6399 in this example) are open.
* **Version Mismatch**: Ensure all nodes run the same version of Ray and Python.

### Installation
```bash
pip install "ray[default]"
```
1 change: 0 additions & 1 deletion docs/design/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ This section contains design documents and architecture specifications for vLLM-
## Feature Design Documents

- [Disaggregated Inference](feature/disaggregated_inference.md)
- [Multi-Request Streaming](feature/multi_request_streaming.md)
- [Ray-based Execution](feature/ray_based_execution.md)

## Module Design Documents
Expand Down