Skip to content

[RFC]: Support async computation and communication across stages by chunks #268

@R2-Y

Description

@R2-Y

Motivation.

Proposal for the design for supporting async computation and communication across stages by chunks

Why?

  • Purpose: Process large prefill requests in smaller chunks, batching them with decode requests
  • Benefits:
    • Better GPU utilization by balancing compute-bound (prefill) and memory-bound (decode) operations
    • Improved inter-token latency (ITL) by prioritizing decode requests
    • Reduced time-to-first-token (TTFT) for long prompts
    • Reduced first-packet-latency for audio generation
  • Current vLLM-Omni Support: Enabled by default in vLLM-Omni when enable_chunked_prefill=True & set chunk size with max_num_batched_tokens=2048 .
  • Problem :
  1. Chunked prefill has natural support for text generation stage, e.g.qwen3-omni thinker stage,
Image but doesn't support for modality generation stage, e.g. qwen3-omni talker stage. Need to support.
  1. It cannot asynchronously send chunks to the next stage, so it does not achieve a streaming pipeline.
    Current workflow for multi-stage model is each stage processes entire requests before forwarding

     Request → Stage-0 (Thinker) → Stage-1 (Talker) → ... → Final Output
              [Full Prefill+decode]      [Full Prefill+decode]     [Full Prefill+decode]
    

Proposed Change.

use qwen series omni modal as example:
current:

Image

after change:

Image

This RFC will be implemented in different phases.

1. support streaming chunk output

In the current reasoning process, the reasoning in each stage is sequential; that is, the reasoning result of stage-0 is converted into the input of the next stage. To enable pipelined parallelism between different stages, we need to send requests to different stages simultaneously.
AS IS:

Image

TO BE:

Image

At this stage, we use OmniConnector to pass the stage's output. Supporting async chunks requires using the model's intermediate output as input for the next stage.

Image

Implementing the Design:

**Add the asynchronous switch.**
vllm_group.add_argument("--async-prefilling", **vllm_kwargs["async_prefilling"])

class AsyncOmni(OmniBase):
    async def generate():
        self.stage_list[0].submit(task)
        self.stage_list[1].submit(task_1)

class OmniConnectorBase(ABC):
    def __init__():
        self.stage_id = stage_id
        self.request: dict[str, int] = {}  # maintain chunk info

    def put_async():
        key = f"{request_id}_{self.stage_id}_{chunk}"
        pass
    def get_async():
        target_stage_id = self.stage_id - 1
        key = f"{request_id}_{target_stage_id}_{chunk}"
        pass

def maybe_send_chunk_via_connector():
    connector.put_async()

def maybe_recv_chunk_via_connector():
    connector.get_async()

class OmniGPUModelRunner(GPUModelRunner):
    def _model_forward():
        maybe_recv_chunk_via_connector()
        model_output = super()._model_forward(
            input_ids=input_ids,
            positions=positions,
            intermediate_tensors=intermediate_tensors,
            inputs_embeds=inputs_embeds,
            **model_kwargs,
            **model_kwargs_extra,
        )
        maybe_send_chunk_via_connector()

Image

subtask:

  • thinker->talker pipeline: When Thinker completes prefilling the current chunk, its output high-level representations
    are immediately used to prefill the Talker's current chunk asynchronously, while Thinker continues to prefill its next
    chunk. This approach significantly reduces the Time-To-First-Token (TTFT) for both Thinker and Talker
  • talker->code2wav pipeline: Once the talker generates the first token, the MTP module predicts the remaining tokens for the current frame. These tokens are then decoded into waveforms by a streaming multi-codebook codec decoder that only considers the left context.
  • code2wav chunked decode: . To minimize the user's waiting time for receiving the first generated
    packet, we propose a left-context-only multi-codebook generation mechanism.
  • audio streaming output: Qwen3-Omni can output the waveform immediately after the Talker generates each token, significantly reducing the latency.
    of the first packet.

remaining bugfix:

  • stage-1 & stage-2 metric's infos are empty
  • the output audio was truncated and incomplete with Multimodal input
  • the chunk_size and context_size of qwen3-omni must same now, need to fix algorithm
  • the request_id is inconsistent in different stages for same request, we use hard code to avoid this problem, need to fix

2. support chunked prefill across stage

  • support talker chunked prefill
  • support chunked pipeline in prefill phase between thinker and talker

3. support async put & get

two background loops for put & get requests

4. Add test cases (UT & CI)

  • UT
  • CI

5. support MooncakeConnector

Pending

6. Add design documents for user & developer

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Sub-issues

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions