[RFC]: Support async computation and communication across stages by chunks

### Motivation.

### Proposal for the design for supporting async computation and communication across stages by chunks

### Why? 
- **Purpose**: Process large prefill requests in smaller chunks, batching them with decode requests
- **Benefits**: 
  - Better GPU utilization by balancing compute-bound (prefill) and memory-bound (decode) operations
  - Improved inter-token latency (ITL) by prioritizing decode requests
  - Reduced time-to-first-token (TTFT) for long prompts
  - Reduced first-packet-latency for audio generation
- **Current vLLM-Omni Support**: Enabled by default in vLLM-Omni when `enable_chunked_prefill=True` & set chunk size with `max_num_batched_tokens=2048` .
-  **Problem** :   
 1. Chunked prefill has natural support for text generation stage, e.g.qwen3-omni thinker stage, 
<img width="1272" height="191" alt="Image" src="https://github.com/user-attachments/assets/02e884c9-d82e-436e-bac6-b9ace298cc4e" />
but doesn't support for modality generation stage, e.g. qwen3-omni talker stage. Need to support. 

2. It cannot asynchronously send chunks to the next stage, so it does not achieve a streaming pipeline.
Current workflow for multi-stage model is each stage processes entire requests before forwarding
    
   ```
    Request → Stage-0 (Thinker) → Stage-1 (Talker) → ... → Final Output
             [Full Prefill+decode]      [Full Prefill+decode]     [Full Prefill+decode]
   ```

### Proposed Change.
use qwen series omni modal as example:
current:

<img width="1049" height="967" alt="Image" src="https://github.com/user-attachments/assets/d833de8e-6a6a-42fa-8bfc-5ba55fc2e33c" />

after change: 

<img width="601" height="887" alt="Image" src="https://github.com/user-attachments/assets/767a004c-a36e-4297-84a0-20a00ad0efb9" />

#### This RFC will be implemented in different phases.
#### 1. support streaming chunk output 
In the current reasoning process, the reasoning in each stage is sequential; that is, the reasoning result of stage-0 is converted into the input of the next stage. To enable pipelined parallelism between different stages, we need to send requests to different stages simultaneously.
AS IS:

<img width="815" height="442" alt="Image" src="https://github.com/user-attachments/assets/4da0aede-f531-46ca-9d2c-9b98405ad235" />

TO BE:

<img width="818" height="479" alt="Image" src="https://github.com/user-attachments/assets/c2adc4e1-25e0-4395-b150-418d4781d0db" />

At this stage, we use OmniConnector to pass the stage's output. Supporting async chunks requires using the model's intermediate output as input for the next stage.

<img width="811" height="400" alt="Image" src="https://github.com/user-attachments/assets/be898472-a121-4b16-9257-52b7b1be9d66" />

Implementing the Design:
```
**Add the asynchronous switch.**
vllm_group.add_argument("--async-prefilling", **vllm_kwargs["async_prefilling"])

class AsyncOmni(OmniBase):
    async def generate():
        self.stage_list[0].submit(task)
        self.stage_list[1].submit(task_1)

class OmniConnectorBase(ABC):
    def __init__():
        self.stage_id = stage_id
        self.request: dict[str, int] = {}  # maintain chunk info

    def put_async():
        key = f"{request_id}_{self.stage_id}_{chunk}"
        pass
    def get_async():
        target_stage_id = self.stage_id - 1
        key = f"{request_id}_{target_stage_id}_{chunk}"
        pass

def maybe_send_chunk_via_connector():
    connector.put_async()

def maybe_recv_chunk_via_connector():
    connector.get_async()

class OmniGPUModelRunner(GPUModelRunner):
    def _model_forward():
        maybe_recv_chunk_via_connector()
        model_output = super()._model_forward(
            input_ids=input_ids,
            positions=positions,
            intermediate_tensors=intermediate_tensors,
            inputs_embeds=inputs_embeds,
            **model_kwargs,
            **model_kwargs_extra,
        )
        maybe_send_chunk_via_connector()

```

<img width="822" height="660" alt="Image" src="https://github.com/user-attachments/assets/74b948d0-ab69-472b-a92a-e44dd5413056" />

#### subtask:
- [x] **thinker->talker pipeline**: When Thinker completes prefilling the current chunk, its output high-level representations
are immediately used to prefill the Talker's current chunk asynchronously, while Thinker continues to prefill its next
chunk. This approach significantly reduces the Time-To-First-Token (TTFT) for both Thinker and Talker
- [x] **talker->code2wav pipeline**: Once the talker generates the first token, the MTP module predicts the remaining tokens for the current frame. These tokens are then decoded into waveforms by a streaming multi-codebook codec decoder that only considers the left context.
- [x] **code2wav chunked decode**: . To minimize the user's waiting time for receiving the first generated
packet, we propose a left-context-only multi-codebook generation mechanism.
- [x] **audio streaming output**: Qwen3-Omni can output the waveform immediately after the Talker generates each token, significantly reducing the latency.
of the first packet.

#### remaining bugfix:
- [x] **stage-1 & stage-2 metric's infos are empty**
- [x] **the output audio was truncated and incomplete with Multimodal input**
- [ ] **the chunk_size and context_size of qwen3-omni must same now, need to fix algorithm**
- [x] **the request_id is inconsistent in different stages for same request, we use hard code to avoid this problem, need to fix**


#### 2. support chunked prefill across stage
- [ ] **support talker chunked prefill**
- [ ] **support chunked pipeline in prefill phase between thinker and talker**

#### 3. support async put & get
two background loops for put & get requests

#### 4. Add test cases (UT & CI)
- [ ] **UT**
- [ ] **CI**

#### 5. support MooncakeConnector
Pending

#### 6. Add design documents for user & developer


### Feedback Period.

_No response_

### CC List.

_No response_

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Support async computation and communication across stages by chunks #268

Motivation.

Proposal for the design for supporting async computation and communication across stages by chunks

Why?

Proposed Change.

This RFC will be implemented in different phases.

1. support streaming chunk output

subtask:

remaining bugfix:

2. support chunked prefill across stage

3. support async put & get

4. Add test cases (UT & CI)

5. support MooncakeConnector

6. Add design documents for user & developer

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC]: Support async computation and communication across stages by chunks #268

Description

Motivation.

Proposal for the design for supporting async computation and communication across stages by chunks

Why?

Proposed Change.

This RFC will be implemented in different phases.

1. support streaming chunk output

subtask:

remaining bugfix:

2. support chunked prefill across stage

3. support async put & get

4. Add test cases (UT & CI)

5. support MooncakeConnector

6. Add design documents for user & developer

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions