[Bugfix] reused metrics to modify the API Server token statistics in Stream Response by kechengliu97 · Pull Request #1301 · vllm-project/vllm-omni

kechengliu97 · 2026-02-10T08:18:17Z

This pull request introduces comprehensive unit tests for the output_tokens and text_latency attributes in the patching logic for OpenAI chat completions, and makes several improvements to the metric calculation and patch implementation. The changes ensure that these attributes are correctly initialized, assigned, and tested for various response scenarios, including mixed text/audio modalities and the presence or absence of metrics data. Additionally, the calculation of throughput per output token (tpot) is updated to use the new text_latency attribute for more accurate benchmarking.

Testing improvements:

Added test_patch_output_tokens.py with unit tests to verify correct assignment and initialization of the output_tokens attribute in various scenarios, including when metrics are present, absent, or incomplete, and with mixed audio/text responses.
Added test_text_latency.py with unit tests to ensure the text_latency attribute is present, correctly initialized, and properly updated for text and mixed modality responses, and to check its relationship with ttft and latency.

Patch implementation enhancements:

Added text_latency: float = 0.0 to the MixRequestFuncOutput class, ensuring the attribute is always present and initialized.
Updated the async request function to assign output.text_latency as the time elapsed since the start when a text chunk is received, providing a consistent measure of text response latency.
Changed the assignment of output.output_tokens to use the metrics["num_tokens_out"] value, defaulting to 0 if missing, instead of relying on the usage field.

Metrics calculation update:

Modified the calculate_metrics function to use outputs[i].text_latency instead of outputs[i].latency when computing throughput per output token, aligning the metric with the new attribute and improving accuracy.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39ed743b36

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-02-10T08:22:02Z

vllm_omni/entrypoints/async_omni.py

+                    self._assign_output_metrics(
+                        output_to_yield=output_to_yield,
+                        metrics=metrics,
+                        request_id=request_id,
+                        stage_id=stage_id,


Assign output metrics in sequential processing path

This metrics assignment is only wired into _process_async_results; when async_chunk is disabled (the default in vllm_omni/config/model.py and common stage configs), _process_sequential_results yields OmniRequestOutput objects without calling _assign_output_metrics, so omni_res.metrics stays empty and the new metrics field in chat completion responses remains unset. In practice, the token statistics fix in this commit is skipped for the standard non-async execution path.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This PR adds support for propagating per-stage output metrics (e.g., token counts and stage metadata) from the backend orchestrator through to the OpenAI-compatible chat completion responses (streaming and non-streaming), enabling downstream consumers to access richer observability data.

Changes:

Attach per-stage metrics to OmniRequestOutput when a stage finishes in AsyncOmni.
Extend OpenAI-protocol response models to include an optional metrics field.
Propagate metrics through streaming and full chat completion responses, and update benchmark parsing to read token counts from metrics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
vllm_omni/entrypoints/openai/serving_chat.py	Captures metrics from omni outputs and includes them in streaming/full chat completion responses via Omni response models.
vllm_omni/entrypoints/openai/protocol/chat_completion.py	Adds `metrics` to Omni chat completion response models.
vllm_omni/entrypoints/async_omni.py	Adds `_assign_output_metrics` to extract stage metrics and attach them to yielded outputs.
vllm_omni/benchmarks/patch/patch.py	Updates benchmark stream parsing to read output tokens from `metrics`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-10T10:38:35Z

vllm_omni/entrypoints/openai/serving_chat.py

                            choices=[choice_data],
                            model=model_name,
                            modality=final_output_type,
+                            metrics=final_metrics,
                        )


metrics=final_metrics is always passed into streamed chunks, and model_dump_json(exclude_unset=True) will then serialize it as "metrics": null for every chunk until metrics become available. To avoid noisy/possibly breaking payload changes, only set the metrics field when final_metrics is not None (or switch to exclude_none=True for these chunk dumps if that won’t affect other fields).

Copilot · 2026-02-10T10:38:35Z

vllm_omni/entrypoints/openai/serving_chat.py

            prompt_logprobs=prompt_logprobs,
            prompt_token_ids=prompt_token_ids,
            kv_transfer_params=kv_transfer_params,
+            metrics=response_metrics,
        )


The non-streaming response always sets metrics=response_metrics even when response_metrics is None. The API server serializes responses with model_dump(... ) (without exclude_none=True), so this will add a persistent "metrics": null field to all non-stream chat completions. Consider omitting the metrics field entirely when it’s not available to keep the response schema stable.

Copilot · 2026-02-10T10:38:35Z

vllm_omni/benchmarks/patch/patch.py

-                            elif usage := data.get("usage"):
-                                output.output_tokens = usage.get("completion_tokens")
+                            if current_metrics := data.get("metrics"):
+                                output.output_tokens = current_metrics.get("num_tokens_out")


This switches benchmark token counting from usage.completion_tokens to metrics.num_tokens_out only. For compatibility with servers/requests that don’t emit metrics (or for non-text modalities), it would be safer to keep a fallback to usage.completion_tokens when metrics is missing/empty so benchmark output token accounting remains correct.

Suggested change

output.output_tokens = current_metrics.get("num_tokens_out")

num_tokens_out = current_metrics.get("num_tokens_out")

if num_tokens_out is not None:

output.output_tokens = num_tokens_out

elif usage := data.get("usage"):

completion_tokens = usage.get("completion_tokens")

if completion_tokens is not None:

output.output_tokens = completion_tokens

Copilot · 2026-02-10T10:38:36Z

vllm_omni/entrypoints/openai/serving_chat.py

+                if omni_res.metrics:
+                    final_metrics = omni_res.metrics


The new metrics propagation path (final_metrics capture and inclusion in streamed responses) should have a unit/integration test to ensure (1) metrics appear when a stage finishes, and (2) metrics are not emitted as null/empty on intermediate chunks. There are already unit tests for OmniOpenAIServingChat in this repo, so adding coverage here would help prevent regressions in the OpenAI-compatible response schema.

yenuo26 · 2026-02-10T11:33:02Z

Please provide the benchmark running results.

amy-why-3459 · 2026-02-10T12:01:47Z

vllm_omni/entrypoints/async_omni.py

                all_stages_finished[stage_id] = finished

                if output_to_yield:
+                    self._assign_output_metrics(


Is it possible to move this function to the _process_single_result function?

kechengliu97 · 2026-02-10T12:39:10Z

Test Result

execute the command the result is obtained as below, everything goes smoothly.

(l30053556) root@huawei:/nvme1n1p1/l30053556/vllm-omni# vllm bench serve   --omni   --port 45699 --endpoint /v1/chat/completions   --backend openai-chat-omni   --model /nvme1n1p1/models/Qwen3-Omni-30B-A3B-Instruct   --dataset-name random   --num-prompts 2   --random-prefix-len 5   --random-input-len 100   --random-output-len 100   --percentile-metrics ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration   --ignore-eos
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function OmniBenchmarkServingSubcommand.cmd at 0x7fa470d4d300>, omni=True, seed=0, num_prompts=2, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, random_input_len=100, random_output_len=100, random_range_ratio=0.0, random_prefix_len=5, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai-chat-omni', base_url=None, host='127.0.0.1', port=45699, endpoint='/v1/chat/completions', header=None, max_concurrency=None, model='/nvme1n1p1/models/Qwen3-Omni-30B-A3B-Instruct', input_len=None, output_len=None, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, trust_remote_code=False, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=True, percentile_metrics='ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration', metric_percentiles='99', goodput=None, request_id_prefix='bench-ea156d8d-', top_p=None, top_k=None, min_p=None, temperature=None, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None)
INFO 02-10 12:38:35 [datasets.py:612] Sampling input_len from [100, 100] and output_len from [100, 100]
WARNING: vllm bench serve no longer sets temperature==0 (greedy) in requests by default. The default will be determined on the server side and can be model/API specific. For the old behavior, include --temperature=0.
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:34<00:00, 17.03s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     2         
Failed requests:                         0         
Benchmark duration (s):                  34.06     
Request throughput (req/s):              0.06      
Peak concurrent requests:                2.00      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          32506.64  
Median E2EL (ms):                        32506.64  
P99 E2EL (ms):                           34024.48  
================== Text Result ===================
Total input tokens:                      210       
Total generated tokens:                  200       
Output token throughput (tok/s):         5.87      
Peak output token throughput (tok/s):    34.00     
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          12.04     
---------------Time to First Token----------------
Mean TTFT (ms):                          128.04    
Median TTFT (ms):                        128.04    
P99 TTFT (ms):                           161.98    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          327.06    
Median TPOT (ms):                        327.06    
P99 TPOT (ms):                           342.73    
---------------Inter-token Latency----------------
Mean ITL (ms):                           61.24     
Median ITL (ms):                         60.41     
P99 ITL (ms):                            69.58     
================== Audio Result ==================
Total audio duration generated(s):       44.51     
Total audio frames generated:            1068330   
Audio throughput(audio duration/s):      1.31      
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          1.46      
Median AUDIO_RTF:                        1.46      
P99 AUDIO_RTF:                           1.47      
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    32410.63  
Median AUDIO_TTFP (ms):                  32410.63  
P99 AUDIO_TTFP (ms):                     33947.04  
------------------Audio Duration------------------
Mean AUDIO_DURATION (s):                 22.26     
Median AUDIO_DURATION (s):               22.26     
P99 AUDIO_DURATION (s):                  23.55     
==================================================

kechengliu97 · 2026-02-10T12:39:42Z

Test Result

execute the command the result is obtained as below, everything goes smoothly.

(l30053556) root@huawei:/nvme1n1p1/l30053556/vllm-omni# vllm bench serve   --omni   --port 45699 --endpoint /v1/chat/completions   --backend openai-chat-omni   --model /nvme1n1p1/models/Qwen3-Omni-30B-A3B-Instruct   --dataset-name random   --num-prompts 2   --random-prefix-len 5   --random-input-len 100   --random-output-len 100   --percentile-metrics ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration   --ignore-eos
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function OmniBenchmarkServingSubcommand.cmd at 0x7fa470d4d300>, omni=True, seed=0, num_prompts=2, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, random_input_len=100, random_output_len=100, random_range_ratio=0.0, random_prefix_len=5, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai-chat-omni', base_url=None, host='127.0.0.1', port=45699, endpoint='/v1/chat/completions', header=None, max_concurrency=None, model='/nvme1n1p1/models/Qwen3-Omni-30B-A3B-Instruct', input_len=None, output_len=None, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, trust_remote_code=False, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=True, percentile_metrics='ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration', metric_percentiles='99', goodput=None, request_id_prefix='bench-ea156d8d-', top_p=None, top_k=None, min_p=None, temperature=None, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None)
INFO 02-10 12:38:35 [datasets.py:612] Sampling input_len from [100, 100] and output_len from [100, 100]
WARNING: vllm bench serve no longer sets temperature==0 (greedy) in requests by default. The default will be determined on the server side and can be model/API specific. For the old behavior, include --temperature=0.
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:34<00:00, 17.03s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     2         
Failed requests:                         0         
Benchmark duration (s):                  34.06     
Request throughput (req/s):              0.06      
Peak concurrent requests:                2.00      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          32506.64  
Median E2EL (ms):                        32506.64  
P99 E2EL (ms):                           34024.48  
================== Text Result ===================
Total input tokens:                      210       
Total generated tokens:                  200       
Output token throughput (tok/s):         5.87      
Peak output token throughput (tok/s):    34.00     
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          12.04     
---------------Time to First Token----------------
Mean TTFT (ms):                          128.04    
Median TTFT (ms):                        128.04    
P99 TTFT (ms):                           161.98    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          327.06    
Median TPOT (ms):                        327.06    
P99 TPOT (ms):                           342.73    
---------------Inter-token Latency----------------
Mean ITL (ms):                           61.24     
Median ITL (ms):                         60.41     
P99 ITL (ms):                            69.58     
================== Audio Result ==================
Total audio duration generated(s):       44.51     
Total audio frames generated:            1068330   
Audio throughput(audio duration/s):      1.31      
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          1.46      
Median AUDIO_RTF:                        1.46      
P99 AUDIO_RTF:                           1.47      
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    32410.63  
Median AUDIO_TTFP (ms):                  32410.63  
P99 AUDIO_TTFP (ms):                     33947.04  
------------------Audio Duration------------------
Mean AUDIO_DURATION (s):                 22.26     
Median AUDIO_DURATION (s):               22.26     
P99 AUDIO_DURATION (s):                  23.55     
==================================================

@yenuo26

amy-why-3459 · 2026-02-10T13:03:41Z

The method for calculating TPOT is incorrect.

amy-why-3459 · 2026-02-11T01:06:07Z

How can we obtain the talker's TPOT?

Propagate per-request output metrics through the Omni pipeline and include them in OpenAI-compatible responses. Added AsyncOmni._assign_output_metrics to attach stage metrics (num_tokens_in/out, stage_id, final_output_type) to OmniRequestOutput when a stage finishes and the final output is text. Extended protocol types with metrics fields (OmniChatCompletionStreamResponse, OmniChatCompletionResponse) and updated serving_chat to collect final_metrics from generator outputs and include them in both streaming/usage chunks and the final chat response. Also adjusted imports and types to use the new Omni response classes. Signed-off-by: John Liu BUAA <[email protected]>

Signed-off-by: John Liu BUAA <[email protected]>

Delete the _assign_output_metrics method from AsyncOmni. The removed code previously inspected OrchestratorAggregator stage events to populate output_to_yield.metrics for finished requests; metrics are no longer assigned here, simplifying the class and leaving metrics handling to other parts of the codebase. Signed-off-by: John Liu BUAA <[email protected]>

kechengliu97 · 2026-02-12T03:25:48Z

The latest version with both async_chunk and non_async_chunk is tested with benchmark, the statistic result is approved by @amy-why-3459 and @yenuo26 .

Test results:

============ Serving Benchmark Result ============
Successful requests:                     2         
Failed requests:                         0         
Benchmark duration (s):                  13.87     
Request throughput (req/s):              0.14      
Peak concurrent requests:                2.00      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          13196.72  
Median E2EL (ms):                        13196.72  
P99 E2EL (ms):                           13856.47  
================== Text Result ===================
Total input tokens:                      210       
Total generated tokens:                  200       
Output token throughput (tok/s):         14.42     
Peak output token throughput (tok/s):    124.00    
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          29.56     
---------------Time to First Token----------------
Mean TTFT (ms):                          1158.24   
Median TTFT (ms):                        1158.24   
P99 TTFT (ms):                           1583.22   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.37     
Median TPOT (ms):                        21.37     
P99 TPOT (ms):                           25.65     
---------------Inter-token Latency----------------
Mean ITL (ms):                           21.16     
Median ITL (ms):                         15.65     
P99 ITL (ms):                            143.38    
================== Audio Result ==================
Total audio duration generated(s):       60.86     
Total audio frames generated:            1460640   
Audio throughput(audio duration/s):      4.39      
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.43      
Median AUDIO_RTF:                        0.43      
P99 AUDIO_RTF:                           0.44      
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    7619.68   
Median AUDIO_TTFP (ms):                  7619.68   
P99 AUDIO_TTFP (ms):                     12425.96  
------------------Audio Duration------------------
Mean AUDIO_DURATION (s):                 30.43     
Median AUDIO_DURATION (s):               30.43     
P99 AUDIO_DURATION (s):                  32.52     
==================================================

Non-async-chunk:

============ Serving Benchmark Result ============
Successful requests:                     2         
Failed requests:                         0         
Benchmark duration (s):                  38.59     
Request throughput (req/s):              0.05      
Peak concurrent requests:                2.00      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          34155.57  
Median E2EL (ms):                        34155.57  
P99 E2EL (ms):                           38501.78  
================== Text Result ===================
Total input tokens:                      210       
Total generated tokens:                  200       
Output token throughput (tok/s):         5.18      
Peak output token throughput (tok/s):    34.00     
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          10.62     
---------------Time to First Token----------------
Mean TTFT (ms):                          823.15    
Median TTFT (ms):                        823.15    
P99 TTFT (ms):                           1107.05   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          90.69     
Median TPOT (ms):                        90.69     
P99 TPOT (ms):                           93.19     
---------------Inter-token Latency----------------
Mean ITL (ms):                           89.78     
Median ITL (ms):                         60.96     
P99 ITL (ms):                            601.10    
================== Audio Result ==================
Total audio duration generated(s):       36.43     
Total audio frames generated:            874410    
Audio throughput(audio duration/s):      0.94      
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          1.89      
Median AUDIO_RTF:                        1.89      
P99 AUDIO_RTF:                           2.00      
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    34070.41  
Median AUDIO_TTFP (ms):                  34070.41  
P99 AUDIO_TTFP (ms):                     38418.92  
------------------Audio Duration------------------
Mean AUDIO_DURATION (s):                 18.22     
Median AUDIO_DURATION (s):               18.22     
P99 AUDIO_DURATION (s):                  21.59     
==================================================

Gaohan123

Please supplement a single UT

Gaohan123 · 2026-02-12T03:39:15Z

vllm_omni/benchmarks/patch/patch.py


-                            elif usage := data.get("usage"):
-                                output.output_tokens = usage.get("completion_tokens")
+                            if metrics := data.get("metrics"):


Set default values to avoid possible error

This selection is legal cause := firstly get the value from the formula and then transfer to the metrics param. If no attribution is found, it returns None, which makes this judgement execute no more.

Gaohan123 · 2026-02-12T03:40:18Z

vllm_omni/benchmarks/metrics/metrics.py

            tpot = 0
            if output_len > 1:
-                latency_minus_ttft = outputs[i].latency - outputs[i].ttft
+                latency_minus_ttft = outputs[i].text_latency - outputs[i].ttft


This attribute is pre-defined in the struct, having a default value as 0

Avoid returning None when the metrics key is missing by defaulting num_tokens_out to 0. This ensures downstream code that expects a numeric value (e.g., for aggregation or arithmetic) won't error when the metric is absent. Signed-off-by: John Liu BUAA <[email protected]>

…lm-omni into lkc-usage-bugfix

Introduce comprehensive unit tests for async_request_openai_chat_omni_completions and MixRequestFuncOutput. The new tests cover output_tokens handling (including missing and multiple metric updates, mixed modalities), text_latency behavior and consistency (initialization, updates across chunks, audio-only and mixed modalities), and basic initialization of MixRequestFuncOutput. Includes MockResponse and create_sse_chunk helpers to simulate SSE streaming responses. Signed-off-by: John Liu BUAA <[email protected]>

…lm-omni into lkc-usage-bugfix

congw729 · 2026-02-12T06:45:53Z

How long this test take?

kechengliu97 · 2026-02-12T06:59:36Z

How long this test take?

0.08s

Gaohan123

LGTM. Thanks!

…Stream Response (vllm-project#1301) Signed-off-by: John Liu BUAA <[email protected]>

kechengliu97 requested a review from hsliuustc0106 as a code owner February 10, 2026 08:18

kechengliu97 changed the title ~~[Bugfix] reused metrics to modify the API Server token statistics~~ [Bugfix] reused metrics to modify the API Server token statistics in Stream Response Feb 10, 2026

chatgpt-codex-connector bot reviewed Feb 10, 2026

View reviewed changes

kechengliu97 force-pushed the lkc-usage-bugfix branch from 39ed743 to 6e81d19 Compare February 10, 2026 09:45

hsliuustc0106 requested a review from Copilot February 10, 2026 10:26

Copilot started reviewing on behalf of hsliuustc0106 February 10, 2026 10:33 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

amy-why-3459 reviewed Feb 10, 2026

View reviewed changes

kechengliu97 force-pushed the lkc-usage-bugfix branch from 9e467fb to 76e72cf Compare February 11, 2026 01:52

kechengliu97 force-pushed the lkc-usage-bugfix branch from 76e72cf to f5312af Compare February 11, 2026 02:18

kechengliu97 added 3 commits February 11, 2026 10:19

Merge branch 'main' into lkc-usage-bugfix

d5f4c59

Merge branch 'main' into lkc-usage-bugfix

2d7d7e5

Signed-off-by: John Liu BUAA <[email protected]>

Merge branch 'main' into lkc-usage-bugfix

92b855e

Gaohan123 reviewed Feb 12, 2026

View reviewed changes

kechengliu97 added 2 commits February 12, 2026 11:46

Merge branch 'lkc-usage-bugfix' of https://github.com/kechengliu97/vl…

cb223a8

…lm-omni into lkc-usage-bugfix

kechengliu97 force-pushed the lkc-usage-bugfix branch 5 times, most recently from 8cdb6c7 to 68435f9 Compare February 12, 2026 06:33

kechengliu97 force-pushed the lkc-usage-bugfix branch from 68435f9 to 671c7f1 Compare February 12, 2026 06:40

kechengliu97 added 3 commits February 12, 2026 14:41

Merge branch 'main' into lkc-usage-bugfix

3fd876b

Merge branch 'lkc-usage-bugfix' of https://github.com/kechengliu97/vl…

a0bf25f

…lm-omni into lkc-usage-bugfix

Merge branch 'main' into lkc-usage-bugfix

8f8c69d

Gaohan123 added the ready label to trigger buildkite CI label Feb 12, 2026

Gaohan123 approved these changes Feb 12, 2026

View reviewed changes

Gaohan123 merged commit f117a07 into vllm-project:main Feb 12, 2026
6 of 7 checks passed

kechengliu97 deleted the lkc-usage-bugfix branch February 12, 2026 09:27

YanickSchraner pushed a commit to YanickSchraner/vllm-omni that referenced this pull request Feb 20, 2026

[Bugfix] reused metrics to modify the API Server token statistics in …

411cfc7

…Stream Response (vllm-project#1301) Signed-off-by: John Liu BUAA <[email protected]>

-                                output.output_tokens = current_metrics.get("num_tokens_out")
+                                num_tokens_out = current_metrics.get("num_tokens_out")
+                                if num_tokens_out is not None:
+                                    output.output_tokens = num_tokens_out
+                            elif usage := data.get("usage"):
+                                completion_tokens = usage.get("completion_tokens")
+                                if completion_tokens is not None:
+                                    output.output_tokens = completion_tokens

Conversation

kechengliu97 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

yenuo26 commented Feb 10, 2026

Uh oh!

amy-why-3459 Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kechengliu97 commented Feb 10, 2026

Test Result

Uh oh!

kechengliu97 commented Feb 10, 2026

Test Result

Uh oh!

amy-why-3459 commented Feb 10, 2026

Uh oh!

amy-why-3459 commented Feb 11, 2026

Uh oh!

kechengliu97 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Gaohan123 Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

kechengliu97 Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

kechengliu97 Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Gaohan123 Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

kechengliu97 Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

kechengliu97 Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

congw729 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kechengliu97 commented Feb 12, 2026

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

kechengliu97 commented Feb 10, 2026 •

edited

Loading

amy-why-3459 Feb 10, 2026 •

edited

Loading

kechengliu97 commented Feb 12, 2026 •

edited

Loading

kechengliu97 Feb 12, 2026 •

edited

Loading

congw729 commented Feb 12, 2026 •

edited

Loading