[Bugfix] reused metrics to modify the API Server token statistics in Stream Response#1301
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 39ed743b36
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm_omni/entrypoints/async_omni.py
Outdated
| self._assign_output_metrics( | ||
| output_to_yield=output_to_yield, | ||
| metrics=metrics, | ||
| request_id=request_id, | ||
| stage_id=stage_id, |
There was a problem hiding this comment.
Assign output metrics in sequential processing path
This metrics assignment is only wired into _process_async_results; when async_chunk is disabled (the default in vllm_omni/config/model.py and common stage configs), _process_sequential_results yields OmniRequestOutput objects without calling _assign_output_metrics, so omni_res.metrics stays empty and the new metrics field in chat completion responses remains unset. In practice, the token statistics fix in this commit is skipped for the standard non-async execution path.
Useful? React with 👍 / 👎.
39ed743 to
6e81d19
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds support for propagating per-stage output metrics (e.g., token counts and stage metadata) from the backend orchestrator through to the OpenAI-compatible chat completion responses (streaming and non-streaming), enabling downstream consumers to access richer observability data.
Changes:
- Attach per-stage metrics to
OmniRequestOutputwhen a stage finishes inAsyncOmni. - Extend OpenAI-protocol response models to include an optional
metricsfield. - Propagate metrics through streaming and full chat completion responses, and update benchmark parsing to read token counts from
metrics.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| vllm_omni/entrypoints/openai/serving_chat.py | Captures metrics from omni outputs and includes them in streaming/full chat completion responses via Omni response models. |
| vllm_omni/entrypoints/openai/protocol/chat_completion.py | Adds metrics to Omni chat completion response models. |
| vllm_omni/entrypoints/async_omni.py | Adds _assign_output_metrics to extract stage metrics and attach them to yielded outputs. |
| vllm_omni/benchmarks/patch/patch.py | Updates benchmark stream parsing to read output tokens from metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| choices=[choice_data], | ||
| model=model_name, | ||
| modality=final_output_type, | ||
| metrics=final_metrics, | ||
| ) |
There was a problem hiding this comment.
metrics=final_metrics is always passed into streamed chunks, and model_dump_json(exclude_unset=True) will then serialize it as "metrics": null for every chunk until metrics become available. To avoid noisy/possibly breaking payload changes, only set the metrics field when final_metrics is not None (or switch to exclude_none=True for these chunk dumps if that won’t affect other fields).
| prompt_logprobs=prompt_logprobs, | ||
| prompt_token_ids=prompt_token_ids, | ||
| kv_transfer_params=kv_transfer_params, | ||
| metrics=response_metrics, | ||
| ) |
There was a problem hiding this comment.
The non-streaming response always sets metrics=response_metrics even when response_metrics is None. The API server serializes responses with model_dump(... ) (without exclude_none=True), so this will add a persistent "metrics": null field to all non-stream chat completions. Consider omitting the metrics field entirely when it’s not available to keep the response schema stable.
vllm_omni/benchmarks/patch/patch.py
Outdated
| elif usage := data.get("usage"): | ||
| output.output_tokens = usage.get("completion_tokens") | ||
| if current_metrics := data.get("metrics"): | ||
| output.output_tokens = current_metrics.get("num_tokens_out") |
There was a problem hiding this comment.
This switches benchmark token counting from usage.completion_tokens to metrics.num_tokens_out only. For compatibility with servers/requests that don’t emit metrics (or for non-text modalities), it would be safer to keep a fallback to usage.completion_tokens when metrics is missing/empty so benchmark output token accounting remains correct.
| output.output_tokens = current_metrics.get("num_tokens_out") | |
| num_tokens_out = current_metrics.get("num_tokens_out") | |
| if num_tokens_out is not None: | |
| output.output_tokens = num_tokens_out | |
| elif usage := data.get("usage"): | |
| completion_tokens = usage.get("completion_tokens") | |
| if completion_tokens is not None: | |
| output.output_tokens = completion_tokens |
| if omni_res.metrics: | ||
| final_metrics = omni_res.metrics |
There was a problem hiding this comment.
The new metrics propagation path (final_metrics capture and inclusion in streamed responses) should have a unit/integration test to ensure (1) metrics appear when a stage finishes, and (2) metrics are not emitted as null/empty on intermediate chunks. There are already unit tests for OmniOpenAIServingChat in this repo, so adding coverage here would help prevent regressions in the OpenAI-compatible response schema.
|
Please provide the benchmark running results. |
vllm_omni/entrypoints/async_omni.py
Outdated
| all_stages_finished[stage_id] = finished | ||
|
|
||
| if output_to_yield: | ||
| self._assign_output_metrics( |
There was a problem hiding this comment.
Is it possible to move this function to the _process_single_result function?
Test Resultexecute the command the result is obtained as below, everything goes smoothly. |
|
|
The method for calculating TPOT is incorrect. |
|
How can we obtain the talker's TPOT? |
9e467fb to
76e72cf
Compare
Propagate per-request output metrics through the Omni pipeline and include them in OpenAI-compatible responses. Added AsyncOmni._assign_output_metrics to attach stage metrics (num_tokens_in/out, stage_id, final_output_type) to OmniRequestOutput when a stage finishes and the final output is text. Extended protocol types with metrics fields (OmniChatCompletionStreamResponse, OmniChatCompletionResponse) and updated serving_chat to collect final_metrics from generator outputs and include them in both streaming/usage chunks and the final chat response. Also adjusted imports and types to use the new Omni response classes. Signed-off-by: John Liu BUAA <[email protected]>
76e72cf to
f5312af
Compare
Signed-off-by: John Liu BUAA <[email protected]>
Delete the _assign_output_metrics method from AsyncOmni. The removed code previously inspected OrchestratorAggregator stage events to populate output_to_yield.metrics for finished requests; metrics are no longer assigned here, simplifying the class and leaving metrics handling to other parts of the codebase. Signed-off-by: John Liu BUAA <[email protected]>
|
The latest version with both async_chunk and non_async_chunk is tested with benchmark, the statistic result is approved by @amy-why-3459 and @yenuo26 . Test results: Non-async-chunk: |
Gaohan123
left a comment
There was a problem hiding this comment.
Please supplement a single UT
|
|
||
| elif usage := data.get("usage"): | ||
| output.output_tokens = usage.get("completion_tokens") | ||
| if metrics := data.get("metrics"): |
There was a problem hiding this comment.
Set default values to avoid possible error
There was a problem hiding this comment.
This selection is legal cause := firstly get the value from the formula and then transfer to the metrics param. If no attribution is found, it returns None, which makes this judgement execute no more.
| tpot = 0 | ||
| if output_len > 1: | ||
| latency_minus_ttft = outputs[i].latency - outputs[i].ttft | ||
| latency_minus_ttft = outputs[i].text_latency - outputs[i].ttft |
There was a problem hiding this comment.
This attribute is pre-defined in the struct, having a default value as 0
Avoid returning None when the metrics key is missing by defaulting num_tokens_out to 0. This ensures downstream code that expects a numeric value (e.g., for aggregation or arithmetic) won't error when the metric is absent. Signed-off-by: John Liu BUAA <[email protected]>
8cdb6c7 to
68435f9
Compare
Introduce comprehensive unit tests for async_request_openai_chat_omni_completions and MixRequestFuncOutput. The new tests cover output_tokens handling (including missing and multiple metric updates, mixed modalities), text_latency behavior and consistency (initialization, updates across chunks, audio-only and mixed modalities), and basic initialization of MixRequestFuncOutput. Includes MockResponse and create_sse_chunk helpers to simulate SSE streaming responses. Signed-off-by: John Liu BUAA <[email protected]>
68435f9 to
671c7f1
Compare
Introduce comprehensive unit tests for async_request_openai_chat_omni_completions and MixRequestFuncOutput. The new tests cover output_tokens handling (including missing and multiple metric updates, mixed modalities), text_latency behavior and consistency (initialization, updates across chunks, audio-only and mixed modalities), and basic initialization of MixRequestFuncOutput. Includes MockResponse and create_sse_chunk helpers to simulate SSE streaming responses. Signed-off-by: John Liu BUAA <[email protected]>
|
How long this test take? |
0.08s |
…Stream Response (vllm-project#1301) Signed-off-by: John Liu BUAA <[email protected]>
This pull request introduces comprehensive unit tests for the
output_tokensandtext_latencyattributes in the patching logic for OpenAI chat completions, and makes several improvements to the metric calculation and patch implementation. The changes ensure that these attributes are correctly initialized, assigned, and tested for various response scenarios, including mixed text/audio modalities and the presence or absence of metrics data. Additionally, the calculation of throughput per output token (tpot) is updated to use the newtext_latencyattribute for more accurate benchmarking.Testing improvements:
test_patch_output_tokens.pywith unit tests to verify correct assignment and initialization of theoutput_tokensattribute in various scenarios, including when metrics are present, absent, or incomplete, and with mixed audio/text responses.test_text_latency.pywith unit tests to ensure thetext_latencyattribute is present, correctly initialized, and properly updated for text and mixed modality responses, and to check its relationship withttftandlatency.Patch implementation enhancements:
text_latency: float = 0.0to theMixRequestFuncOutputclass, ensuring the attribute is always present and initialized.output.text_latencyas the time elapsed since the start when a text chunk is received, providing a consistent measure of text response latency.output.output_tokensto use themetrics["num_tokens_out"]value, defaulting to 0 if missing, instead of relying on theusagefield.Metrics calculation update:
calculate_metricsfunction to useoutputs[i].text_latencyinstead ofoutputs[i].latencywhen computing throughput per output token, aligning the metric with the new attribute and improving accuracy.