diff --git a/docs/.nav.yml b/docs/.nav.yml
index 47adb87999..c85a01116d 100644
--- a/docs/.nav.yml
+++ b/docs/.nav.yml
@@ -49,7 +49,6 @@ nav:
     - design/architecture_overview.md
     - Feature Design:
       - design/feature/disaggregated_inference.md
-      - design/feature/multi_request_streaming.md
       - design/feature/ray_based_execution.md
     - Module Design:
       - design/module/ar_module.md
diff --git a/docs/design/feature/multi_request_streaming.md b/docs/design/feature/multi_request_streaming.md
deleted file mode 100644
index e3aeb5b121..0000000000
--- a/docs/design/feature/multi_request_streaming.md
+++ /dev/null
@@ -1,47 +0,0 @@
-## Multi-Request Streaming (MRS) on a Single Machine
-
-### 1. Background & Scope
-- All processing runs on a single physical machine with multi-process, per-stage workers. No proxy or network transport involved.
-- Current alignment with vllm-omni: `OmniLLM` supports multiple stages (`OmniStage`). GPU runners already expose streamable steps (prefill/decoding/diffusion), but the entry layer still collects lists and lacks intra-stage streaming and window scheduling.
-- Goal: implement multi-stage, multi-request streaming (MRS) locally. Each stage outputs segments; downstream stages stitch and trigger compute based on configured windows. Shared memory and zero-copy strategies reduce data movement overhead.
-
-### 2. Key Constraints
-- Multi-process per stage: each stage is an independent process with a while loop; device visibility can be configured (`CUDA_VISIBLE_DEVICES`/`torch.cuda.set_device`).
-- Simple IPC (copy-based): use `multiprocessing.Queue`/Pipe for inter-process communication with CPU copies/serialization; do not rely on CUDA IPC/SHM zero-copy in this version.
-- Cross-stage pipeline: different stages can process different requests concurrently (e.g., stage A handles request 1 while stage B handles request 0).
-
-### 3. Architecture Overview
-- Processes & IPC queues
-  - Each "sub-stage" is an OS process (worker). The loop: take from input_queue → compute → put to output_queue.
-  - Inter-stage connection via IPC: copy-based `multiprocessing.Queue` passing dict payloads; use shared memory for large objects.
-  - Each link is SPSC (single-producer/single-consumer): the upstream is the orchestrator and the downstream is a single stage process; queues are unbounded (maxsize=0) on the orchestrator side.
-- Device visibility
-  - Each stage sets `CUDA_VISIBLE_DEVICES` or calls `torch.cuda.set_device` to bind to GPU sets.
-  - A stage may use multiple GPUs internally (TP/PP/DP) but presents as a single stage unit.
-- Simplified IPC: copy-based queues/pipes for data transfer; zero-copy is future work.
-- Pipeline progression: when a stage finishes a request, it enqueues outputs to the downstream stage; if downstream is idle, it starts immediately.
-- Scheduling
-  - A downstream stage triggers only after the upstream completes the request.
-  - Windowed segmentation/stitched triggering is not implemented; intra-stage streaming is not provided.
-
-### 4. IPC Implementation (simplified: copy-based)
-- Use `multiprocessing.Queue`/Pipe for inter-process communication (control + data).
-- Data is serialized/copied via CPU; no CUDA IPC/SHM zero-copy in this version.
-- Backpressure: queues are unbounded; pressure manifests as compute-rate differences. Optional SHM reduces large-object transfer cost; RX/decoding overhead is recorded for observability.
-
-### 5. Scheduling & Cancellation (simplified)
-- Pipeline: when a stage finishes a request, it enqueues to the next stage; that stage immediately pulls the next request from its input queue, enabling cross-stage concurrency.
-- Cancellation/timeout: explicit cancellation/timeouts are not provided; graceful shutdown uses a `None` sentinel sent to each stage input queue.
-
-#### Short sequence example (req0/req1, stage A→B)
-1) t0: stage A handles req0
-2) t1: req0 completes on A → enters B; A immediately starts req1
-3) t2: B handles req0 while A handles req1 (parallel across stages)
-
-### 6. Integration Points (by file)
-- `vllm_omni/entrypoints/omni.py` (Orchestrator)
-  - Class `Omni` orchestrates multi-process stages; constructs `OmniStage` instances in parallel and spawns per-stage workers.
-  - Spawns stage processes per config (set `CUDA_VISIBLE_DEVICES`/`torch.cuda.set_device`), creates control/data channels, builds simple full-trigger flow.
-  - Stats/logging are disabled by default; per-stage and orchestrator stats are only written when explicitly enabled.
-  - Manages process lifecycle: start/wait for readiness, graceful shutdown; forwards results between stages using copy-based IPC and optional SHM.
-  - Stage readiness: each stage emits `{"type": "stage_ready"}` after initialization; the orchestrator waits for all stages or times out and logs diagnostic suggestions.
diff --git a/docs/design/feature/ray_based_execution.md b/docs/design/feature/ray_based_execution.md
index fa793aca55..f69649d227 100644
--- a/docs/design/feature/ray_based_execution.md
+++ b/docs/design/feature/ray_based_execution.md
@@ -1,14 +1,17 @@
 # Distributed utils
 
 This directory (vllm_omni/distributed/ray_utils) contains utilities for distributed execution in vllm-omni, supporting both **Ray** and **Multiprocessing** backends.
-
-## 1. Ray Utils
+## 1. Installation
+```bash
+pip install "ray[default]"
+```
+## 2. Ray Utils
 
 The `ray_utils` module provides helper functions for managing Ray clusters and actors, which is essential for:
 *   **Multi-node deployment**: Running pipeline stages across different physical machines.
 *   **Resource management**: Efficient GPU/CPU allocation.
 
-### 1.1 Basic Usage
+### 2.1 Basic Usage
 
 To use the Ray backend, specify `worker_backend="ray"` when initializing the engine.
 
@@ -21,7 +24,7 @@ vllm serve Qwen/Qwen2.5-Omni-7B \
   --ray-address auto
 ```
 
-### 1.2 Cluster Setup
+### 2.2 Cluster Setup
 
 **Step 1: Start Head Node**
 Run this on your primary machine:
@@ -38,7 +41,7 @@ ray start --address=<HEAD_NODE_IP>:6399
 > **Tip**: For a complete cluster setup script, refer to the vLLM example:
 > [run_cluster.sh](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh)
 
-### 1.3 Distributed Connector Support
+### 2.3 Distributed Connector Support
 
 When running on Ray, the system automatically adapts its communication strategy:
 
@@ -46,16 +49,11 @@ When running on Ray, the system automatically adapts its communication strategy:
 *   **Same-Node**: Can still use `SharedMemoryConnector` for efficiency, or Ray's native object store (plasma).
 *   **SHM threshold default differs**: when `worker_backend="ray"`, the SharedMemoryConnector default threshold is set to `sys.maxsize`, which forces payloads to go inline (no SHM). Override `shm_threshold_bytes` in the connector config if you want SHM for Ray runs.
 
-### 1.4 Internal Helpers
+### 2.4 Internal Helpers
 
 *   **`initialize_ray_cluster`**: Connects to an existing Ray cluster or starts a local one.
 
-## 2. Troubleshooting
+## 3. Troubleshooting
 
 *   **Connection Issues**: Ensure the Ray head node is accessible and ports (default 6399 in this example) are open.
 *   **Version Mismatch**: Ensure all nodes run the same version of Ray and Python.
-
-### Installation
-```bash
-pip install "ray[default]"
-```
diff --git a/docs/design/index.md b/docs/design/index.md
index c5bf7af476..31420550fb 100644
--- a/docs/design/index.md
+++ b/docs/design/index.md
@@ -9,7 +9,6 @@ This section contains design documents and architecture specifications for vLLM-
 ## Feature Design Documents
 
 - [Disaggregated Inference](feature/disaggregated_inference.md)
-- [Multi-Request Streaming](feature/multi_request_streaming.md)
 - [Ray-based Execution](feature/ray_based_execution.md)
 
 ## Module Design Documents