Skip to content

Conversation

@joshuadeng
Copy link
Contributor

@joshuadeng joshuadeng commented Nov 19, 2025

Purpose

Proposal for session-based streaming with sequenced updates with minimal changes to the core scheduler and engine interfaces.

Latest design

AsyncLLM.generate can now optionally take an async generator that produces a stream of StreamingInput objects, instead of a single prompt:

@dataclass
class StreamingInput:
    prompt: PromptType
    sampling_params: SamplingParams | None = None

The SamplingParams for each input can differ, if omitted will use those provided in the original generate call.

The StreamingInput chunks are handled internally as separate requests, where the prompt of each request is the cumulative concatenation of all input prompts so far + their corresponding output tokens - excluding the final sampled token from each request. All generated tokens are returned to the caller in the session's output stream.

So for streaming inputs [A1, B1, C1], [A2, B2], [A3, B3]:

  1. First prompt [A1, B1, C1], generates [D1]
  2. Second prompt is [A1, B1, C1, A2, B2], generates [C2, D2, E2] (D1 discarded)
  3. Third prompt is [A1, B1, C1, A2, B2, C2, D2], generates [C3, D3] (E2 discarded)

Streamed output tokens would be D1, C2, D2, E2, C3, D3. Note that we expect to generalize/parameterize the behaviour w.r.t. which output tokens to retain in successive prompts as a follow-on based on use case requirements.

Inputs are considered completed for a particular session when the provided async generator exits or is closed/garbage collected.

It is not necessary (but also not required) to wait for all of the outputs corresponding to a particular input chunk prior to sending the next one, they are queued internally.

Certain input types/parameters are not yet supported with streaming input:

  • pooling, prompt_embeds, n > 1, output_kind == FINAL_ONLY, and stop strings.

Original PR summary (superseded)

design (please request access):
https://docs.google.com/document/d/16iE0pUsjdlfEdcghiCSLlicKFdvVtxn2aiic3kL9RVY/edit?usp=sharing

  • Enable low‑latency, interactive workloads (e.g., ASR) where inputs arrive incrementally.
  • Reuse vLLM’s request‑centric architecture and KV blocks, minimizing scheduling changes and avoiding re‑prefill.

High‑level design

  • A session is opened by the first request with streaming_sequence_id = 0 for a given request_id.
  • While the session is RUNNING, new updates are queued to the session's queue (streaming_queue).
  • When a chunk finishes, the session transitions to WAITING_FOR_STREAMING_REQ:
  • Queued updates are applied from streaming_queue in place and the session goes back to WAITING to decode the next chunk.
    • If the update sets close_session = True, we mark the session finished with stop_reason = "close_session" and free resources.
  • To keep KV and multimodal alignment correct, each updated session request sends the full prompt history. The unscheduled output token from the previous step, will be pruned before appending new prompt tokens (preserves num_new_tokens correctness and KV alignment).

Test Plan

pytest tests/v1/streaming/test_streaming_scheduler.py
pytest tests/v1/streaming/test_streaming_async_llm.py
pytest tests/v1/streaming/test_streaming_gpu_model_runner.py

Test Result

they pass

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Cursor Bugbot is generating a summary for commit e38b36d. Configure here.


Note

Enables multi-turn, session-based streaming with resumable updates and correct KV/MM alignment across the stack.

  • Add StreamingInput and new APIs AsyncLLM.generate_streaming and AsyncLLM.generate_from_stream; update generate/add_request to support resumable and reuse request queues
  • Extend Request with resumable and streaming_queue plus StreamingUpdate; add RequestStatus.WAITING_FOR_STREAMING_REQ
  • Scheduler: queue streaming updates, _update_request_as_session (prunes last unscheduled output, merges mm offsets, updates tokens/params), propagate resumable, exclude waiting-for-streaming from unfinished counts, and build NewRequestData via _make_new_request_data with copied prompt_token_ids to avoid aliasing
  • GPUModelRunner: handle updated streaming sessions in scheduled_new_reqs via _update_streaming_request (remove from InputBatch, refresh fields, clear output_token_ids)
  • OutputProcessor: allow in-place streaming state updates, only free request when finished and not resumable
  • Minor: compute prompt/encoder properties via accessors; plumb resumable into EngineCoreOutput
  • Add targeted unit tests for AsyncLLM, Scheduler (including aliasing bug prevention), GPUModelRunner, and RequestStatus

Written by Cursor Bugbot for commit e38b36d. This will update automatically on new commits. Configure here.


Note

Introduces resumable, session-based streaming that queues incremental inputs and preserves KV/MM alignment end-to-end.

  • Add StreamingInput and extend AsyncLLM.generate to accept async generators; plumb resumable through add_request and reuse per-request RequestOutputCollector
  • Extend Request with resumable, streaming_queue, and StreamingUpdate; add RequestStatus.WAITING_FOR_STREAMING_REQ and expose it in __str__
  • Scheduler: queue and apply streaming updates via _update_request_as_session (prunes last unscheduled output, merges MM offsets, updates tokens/params), propagate resumable, exclude waiting-for-streaming from unfinished counts, and build NewRequestData via _make_new_request_data using a copied prompt_token_ids to avoid aliasing
  • GPUModelRunner: handle updated streaming sessions in scheduled_new_reqs via _update_streaming_request (remove from InputBatch, refresh fields, clear output_token_ids)
  • OutputProcessor: support in-place streaming state updates and only free requests when finished and not resumable; propagate resumable in EngineCoreOutput
  • Minor: compute prompt/encoder properties via accessors; add unit tests covering AsyncLLM streaming, scheduler lifecycle (incl. aliasing bug prevention), GPUModelRunner updates, and new status

Written by Cursor Bugbot for commit eba8018. This will update automatically on new commits. Configure here.

@mergify mergify bot added v1 tpu Related to Google TPUs labels Nov 19, 2025
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@ywang96 ywang96 self-assigned this Nov 19, 2025
@njhill njhill changed the title [Feature] add session based streaming support to v1 [Feature] add session based streaming input support to v1 Nov 19, 2025
@mergify
Copy link

mergify bot commented Nov 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @joshuadeng.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

RequestStatus.FINISHED_LENGTH_CAPPED: FinishReason.LENGTH,
RequestStatus.FINISHED_ABORTED: FinishReason.ABORT,
RequestStatus.FINISHED_IGNORED: FinishReason.LENGTH,
RequestStatus.WAITING_FOR_STREAMING_REQ: FinishReason.STOP,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we count RequestStatus.WAITING_FOR_STREAMING_REQ as a finished state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to provide a finish reason when creating EngineCoreOutput, otherwise it would be none. the client will expect a finish reason to know generation stopped (temporarily for streaming), so we can consume the output chunk, and it won't hang.

Comment on lines 442 to 453
def _update_streaming_request_state(
self, request: EngineCoreRequest, prompt: str | None
) -> None:
req_state = self.request_states[request.request_id]
if req_state.prompt and prompt:
req_state.prompt += prompt
if req_state.prompt_token_ids is not None and request.prompt_token_ids:
req_state.prompt_token_ids.extend(request.prompt_token_ids)
req_state.prompt_embeds = request.prompt_embeds
if req_state.stats is not None:
req_state.stats.arrival_time = request.arrival_time
req_state.is_prefilling = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to raise error if we are updating a non-streaming/close_session=True request.

data_parallel_rank=data_parallel_rank,
close_streaming_session=True,
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-streaming requests, do we default close_streaming_session to None or True? I personally prefer True and not distinguishing between streaming and non-streaming requests. We can view a non-streaming request as a request that a user has finished their input in the first round.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now it's defaulted to None, per Patrick's comment, but i'll change it to True

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I think None is better is because it cleanly declares close_streaming_session as irrelevant for non-streaming use case (which it is I think). We can see non-streaming requests as "single-round" streaming requests, but I think we don't do this really at the moment (for example because we're abstracting the scheduler into a specialized streaming_scheduler design that should never be used for non-streaming requests)

So streaming requests have to go trough the new streaming_scheduler and are not allowed to call certain functionality from the non-streaming scheduler. By declaring the variable close_streaming_session "not used" with None I think we have an easy mechanism to verify that something doesn't accidentally ends up in the wrong path.

For example we can nicely assert everywhere in streaming_scheduler that close_streaming_session is not None (similar for functions that are used for both streaming and non-streaming assert close_streaming_session is None can be useful for statements / functions that are not relevant for streaming.

But obviously no strong opinion and I'm not super familiar with general design logic of vllm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see now that you also proposed to merge the streaming_scheduler here: https://github.com/vllm-project/vllm/pull/28973/files#r2579527152 => if we're able to merge everything into one then yeah agree

Comment on lines 473 to 474
# For streaming sessions, generator completion is normal
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit for the behavior here?

Copy link

@ErickLuo90 ErickLuo90 Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is needed because let's say for caller side we have this code

output_generator = await generate()
async for request_output in output_generator:
  client_handle_output

--> here triggers GeneratorExit
do_more_things

for non-streaming before, in GeneratorExit, it's fine to just do abort(request_id) which will free everything.
Now, for streaming session, we shouldn't abort(request_id) because client hasn't close session yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change wouldn't be needed with the alternative proposed API

Comment on lines 1265 to 1280
def _handle_non_stopped(
self,
request: Request,
status_before_stop: RequestStatus,
mark_running_stopped: Callable[[Request], None],
model_runner_output: ModelRunnerOutput,
) -> None:
pass

def _handle_finished(
self,
finished_req_ids: set[str],
outputs: dict[int, list[EngineCoreOutput]],
) -> None:
pass

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these functions for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for our internal use case of streaming, we want to be able to mutate request for sessions that are runnning still, and it's handled by overriding this function _handle_non_stopped

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can remove this for now to avoid confusion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah also I think this seems to be a bit model specific and it's weird to have that kind of logic in the shared scheduler.

priority: int = 0

trace_headers: Mapping[str, str] | None = None
close_streaming_session: bool | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We can maybe call this field something like more_tokens_coming (probably also not the best name but you got the point)? Then we can default this field to False for all requests but only set this to True for requests that have more streaming inputs coming.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe re-use resumable that was used in another PR: #25463 and which I think is quite intuitive

from vllm.v1.request import Request, RequestStatus


class StreamingScheduler(Scheduler):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difficulty of merging StreamingScheduler and normal scheduler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • for stopped requests (normally we would finish and free request) for streaming we make status WAITING_FOR_STREAMING_REQ
  • for streaming requests, we create NewRequestData with _all_token_ids (not just prompt), to provide all tokens for the prompt field to include all past inputs (including decoded outputs). Updated streaming requests will create new entries in InputBatch, so we need the full input history to ensure alignment

these are the main 2 issues, however I think if we leverage the more_tokens_coming field we can gate this logic from non streaming behavior and merge StreamingScheduler into normal scheduler

@mergify
Copy link

mergify bot commented Dec 2, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @joshuadeng.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 2, 2025
session_request._all_token_ids[-1]
== session_request._output_token_ids[-1]
)
del session_request._all_token_ids[-1]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also want to highlight this behavior. This removes the last output token (which hasn't been scheduled) from _all_token_ids, as the new request's prompt tokens will replace it. The last output token (unscheduled) thus will not go in the kv cache as it will be replaced by the streaming update request's prompt, but will be returned to the client.

We needed this behavior for our internal use case, but we can handle it another way for oss, i.e. make the new scheduled tokens the last output token + new prompt tokens

@joshuadeng joshuadeng requested a review from njhill January 13, 2026 19:54
@simon-mo simon-mo mentioned this pull request Jan 16, 2026
64 tasks
@njhill
Copy link
Member

njhill commented Jan 16, 2026

Thanks @joshuadeng for all of the updates. I've some more in another branch based on top of this one, which should address the remaining concerns I had. Here is a summary:

  • Pushed down the streaming input logic into the AsyncLLM.add_request method, so that generate itself remains largely unchanged.
  • Removed the resumable flag completed from the external API, input stream completion is now communicated by just finishing/closing the input generator, including if it's garbage collected.
  • To handle outputs having inconsistent metadata I've introduced a queue in the output processor too, so that the updates happen when each sub request completed.
  • To address the issue related to stop checking happening in the output processor, I've blocked the use of stop strings with streaming input requests for now.
  • Also blocked for n > 1 case, pooling models and output_kind == FINAL_ONLY since they will require some additional consideration I think.
  • Removed the num_prompt_tokens @property methods (update the field when num tokens change rather than having the logic in the getter).

Other notes:

  • I haven't tested this yet (!) nor updated your tests.
  • I haven't looked closely at all the metrics/tracing aspects. This isn't a blocker from my pov but we may want to make sure various request-level metrics make sense and are recorded in a coherent way for streaming input reqs.
  • I think the current logic is sound w.r.t. preemption, but not optimal in that we won't preempt streaming-input requests in WAITING_FOR_STREAMING_REQ state. These should probably be high on the preemption list given they are (temporarily) idle. We may also want different resumption logic so that they can be re-prefilled if kvcache space frees up, even if the next input chunk still hasn't arrived. In any case I don't consider that a blocker either.

Please take a look and let me know what you think, in the meantime I'll try to test too.

@njhill
Copy link
Member

njhill commented Jan 20, 2026

@joshuadeng I have now tested/fixed the changes, and added a bunch of e2e tests (in that same branch).

@joshuadeng
Copy link
Contributor Author

joshuadeng commented Jan 20, 2026

@joshuadeng I have now tested/fixed the changes, and added a bunch of e2e tests (in that same branch).

thanks, this is great! will take a closer look and merge your changes if it all looks good

@joshuadeng
Copy link
Contributor Author

@njhill looks good overall, left a comment in joshuadeng#1

@mergify
Copy link

mergify bot commented Jan 22, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @joshuadeng.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 22, 2026
@mergify mergify bot removed the needs-rebase label Jan 23, 2026
njhill and others added 10 commits January 23, 2026 14:58
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
and don't support prompt_embeds with input streaming for now

Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshuadeng @patrickvonplaten @zhuohan123 @ErickLuo90 @ywang96 thanks for all of the work / input / reviews of this! And thanks for the patience while iterating.

I've updated the PR summary above with the final design (though we expect there will be follow-on changes, especially w.r.t. handling of generated tokens).

@njhill njhill merged commit 91601ff into vllm-project:main Jan 24, 2026
51 checks passed
cwazai pushed a commit to cwazai/vllm that referenced this pull request Jan 25, 2026
…ct#28973)

Signed-off-by: Joshua Deng <[email protected]>
Signed-off-by: Patrick von Platen <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Signed-off-by: 陈建华 <[email protected]>
Josephasafg pushed a commit to Josephasafg/vllm that referenced this pull request Jan 27, 2026
…ct#28973)

Signed-off-by: Joshua Deng <[email protected]>
Signed-off-by: Patrick von Platen <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Signed-off-by: Josephasafg <[email protected]>
rayleeku pushed a commit to rayleeku/vllm_sparse_video that referenced this pull request Feb 2, 2026
…ct#28973)

Signed-off-by: Joshua Deng <[email protected]>
Signed-off-by: Patrick von Platen <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Signed-off-by: rayleeku <[email protected]>
rayleeku pushed a commit to rayleeku/vllm_sparse_video that referenced this pull request Feb 2, 2026
…ct#28973)

Signed-off-by: Joshua Deng <[email protected]>
Signed-off-by: Patrick von Platen <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Signed-off-by: rayleeku <[email protected]>
qianlihuang pushed a commit to qianlihuang/vllm that referenced this pull request Feb 3, 2026
…ct#28973)

Signed-off-by: Joshua Deng <[email protected]>
Signed-off-by: Patrick von Platen <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants