-
Notifications
You must be signed in to change notification settings - Fork 456
Description
Motivation.
Proposal for the design for supporting async computation and communication across stages by chunks
Why?
- Purpose: Process large prefill requests in smaller chunks, batching them with decode requests
- Benefits:
- Better GPU utilization by balancing compute-bound (prefill) and memory-bound (decode) operations
- Improved inter-token latency (ITL) by prioritizing decode requests
- Reduced time-to-first-token (TTFT) for long prompts
- Reduced first-packet-latency for audio generation
- Current vLLM-Omni Support: Enabled by default in vLLM-Omni when
enable_chunked_prefill=True& set chunk size withmax_num_batched_tokens=2048. - Problem :
- Chunked prefill has natural support for text generation stage, e.g.qwen3-omni thinker stage,
but doesn't support for modality generation stage, e.g. qwen3-omni talker stage. Need to support.
-
It cannot asynchronously send chunks to the next stage, so it does not achieve a streaming pipeline.
Current workflow for multi-stage model is each stage processes entire requests before forwardingRequest → Stage-0 (Thinker) → Stage-1 (Talker) → ... → Final Output [Full Prefill+decode] [Full Prefill+decode] [Full Prefill+decode]
Proposed Change.
use qwen series omni modal as example:
current:
after change:
This RFC will be implemented in different phases.
1. support streaming chunk output
In the current reasoning process, the reasoning in each stage is sequential; that is, the reasoning result of stage-0 is converted into the input of the next stage. To enable pipelined parallelism between different stages, we need to send requests to different stages simultaneously.
AS IS:
TO BE:
At this stage, we use OmniConnector to pass the stage's output. Supporting async chunks requires using the model's intermediate output as input for the next stage.
Implementing the Design:
**Add the asynchronous switch.**
vllm_group.add_argument("--async-prefilling", **vllm_kwargs["async_prefilling"])
class AsyncOmni(OmniBase):
async def generate():
self.stage_list[0].submit(task)
self.stage_list[1].submit(task_1)
class OmniConnectorBase(ABC):
def __init__():
self.stage_id = stage_id
self.request: dict[str, int] = {} # maintain chunk info
def put_async():
key = f"{request_id}_{self.stage_id}_{chunk}"
pass
def get_async():
target_stage_id = self.stage_id - 1
key = f"{request_id}_{target_stage_id}_{chunk}"
pass
def maybe_send_chunk_via_connector():
connector.put_async()
def maybe_recv_chunk_via_connector():
connector.get_async()
class OmniGPUModelRunner(GPUModelRunner):
def _model_forward():
maybe_recv_chunk_via_connector()
model_output = super()._model_forward(
input_ids=input_ids,
positions=positions,
intermediate_tensors=intermediate_tensors,
inputs_embeds=inputs_embeds,
**model_kwargs,
**model_kwargs_extra,
)
maybe_send_chunk_via_connector()
subtask:
- thinker->talker pipeline: When Thinker completes prefilling the current chunk, its output high-level representations
are immediately used to prefill the Talker's current chunk asynchronously, while Thinker continues to prefill its next
chunk. This approach significantly reduces the Time-To-First-Token (TTFT) for both Thinker and Talker - talker->code2wav pipeline: Once the talker generates the first token, the MTP module predicts the remaining tokens for the current frame. These tokens are then decoded into waveforms by a streaming multi-codebook codec decoder that only considers the left context.
- code2wav chunked decode: . To minimize the user's waiting time for receiving the first generated
packet, we propose a left-context-only multi-codebook generation mechanism. - audio streaming output: Qwen3-Omni can output the waveform immediately after the Talker generates each token, significantly reducing the latency.
of the first packet.
remaining bugfix:
- stage-1 & stage-2 metric's infos are empty
- the output audio was truncated and incomplete with Multimodal input
- the chunk_size and context_size of qwen3-omni must same now, need to fix algorithm
- the request_id is inconsistent in different stages for same request, we use hard code to avoid this problem, need to fix
2. support chunked prefill across stage
- support talker chunked prefill
- support chunked pipeline in prefill phase between thinker and talker
3. support async put & get
two background loops for put & get requests
4. Add test cases (UT & CI)
- UT
- CI
5. support MooncakeConnector
Pending
6. Add design documents for user & developer
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.