Skip to content

[Bugfix] reused metrics to modify the API Server token statistics in Stream Response#1301

Merged
Gaohan123 merged 12 commits intovllm-project:mainfrom
kechengliu97:lkc-usage-bugfix
Feb 12, 2026
Merged

[Bugfix] reused metrics to modify the API Server token statistics in Stream Response#1301
Gaohan123 merged 12 commits intovllm-project:mainfrom
kechengliu97:lkc-usage-bugfix

Conversation

@kechengliu97
Copy link
Contributor

@kechengliu97 kechengliu97 commented Feb 10, 2026

This pull request introduces comprehensive unit tests for the output_tokens and text_latency attributes in the patching logic for OpenAI chat completions, and makes several improvements to the metric calculation and patch implementation. The changes ensure that these attributes are correctly initialized, assigned, and tested for various response scenarios, including mixed text/audio modalities and the presence or absence of metrics data. Additionally, the calculation of throughput per output token (tpot) is updated to use the new text_latency attribute for more accurate benchmarking.

Testing improvements:

  • Added test_patch_output_tokens.py with unit tests to verify correct assignment and initialization of the output_tokens attribute in various scenarios, including when metrics are present, absent, or incomplete, and with mixed audio/text responses.
  • Added test_text_latency.py with unit tests to ensure the text_latency attribute is present, correctly initialized, and properly updated for text and mixed modality responses, and to check its relationship with ttft and latency.

Patch implementation enhancements:

  • Added text_latency: float = 0.0 to the MixRequestFuncOutput class, ensuring the attribute is always present and initialized.
  • Updated the async request function to assign output.text_latency as the time elapsed since the start when a text chunk is received, providing a consistent measure of text response latency.
  • Changed the assignment of output.output_tokens to use the metrics["num_tokens_out"] value, defaulting to 0 if missing, instead of relying on the usage field.

Metrics calculation update:

  • Modified the calculate_metrics function to use outputs[i].text_latency instead of outputs[i].latency when computing throughput per output token, aligning the metric with the new attribute and improving accuracy.

@kechengliu97 kechengliu97 changed the title [Bugfix] reused metrics to modify the API Server token statistics [Bugfix] reused metrics to modify the API Server token statistics in Stream Response Feb 10, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39ed743b36

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 405 to 409
self._assign_output_metrics(
output_to_yield=output_to_yield,
metrics=metrics,
request_id=request_id,
stage_id=stage_id,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Assign output metrics in sequential processing path

This metrics assignment is only wired into _process_async_results; when async_chunk is disabled (the default in vllm_omni/config/model.py and common stage configs), _process_sequential_results yields OmniRequestOutput objects without calling _assign_output_metrics, so omni_res.metrics stays empty and the new metrics field in chat completion responses remains unset. In practice, the token statistics fix in this commit is skipped for the standard non-async execution path.

Useful? React with 👍 / 👎.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for propagating per-stage output metrics (e.g., token counts and stage metadata) from the backend orchestrator through to the OpenAI-compatible chat completion responses (streaming and non-streaming), enabling downstream consumers to access richer observability data.

Changes:

  • Attach per-stage metrics to OmniRequestOutput when a stage finishes in AsyncOmni.
  • Extend OpenAI-protocol response models to include an optional metrics field.
  • Propagate metrics through streaming and full chat completion responses, and update benchmark parsing to read token counts from metrics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
vllm_omni/entrypoints/openai/serving_chat.py Captures metrics from omni outputs and includes them in streaming/full chat completion responses via Omni response models.
vllm_omni/entrypoints/openai/protocol/chat_completion.py Adds metrics to Omni chat completion response models.
vllm_omni/entrypoints/async_omni.py Adds _assign_output_metrics to extract stage metrics and attach them to yielded outputs.
vllm_omni/benchmarks/patch/patch.py Updates benchmark stream parsing to read output tokens from metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1227 to 1231
choices=[choice_data],
model=model_name,
modality=final_output_type,
metrics=final_metrics,
)
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrics=final_metrics is always passed into streamed chunks, and model_dump_json(exclude_unset=True) will then serialize it as "metrics": null for every chunk until metrics become available. To avoid noisy/possibly breaking payload changes, only set the metrics field when final_metrics is not None (or switch to exclude_none=True for these chunk dumps if that won’t affect other fields).

Copilot uses AI. Check for mistakes.
Comment on lines 1398 to 1402
prompt_logprobs=prompt_logprobs,
prompt_token_ids=prompt_token_ids,
kv_transfer_params=kv_transfer_params,
metrics=response_metrics,
)
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-streaming response always sets metrics=response_metrics even when response_metrics is None. The API server serializes responses with model_dump(... ) (without exclude_none=True), so this will add a persistent "metrics": null field to all non-stream chat completions. Consider omitting the metrics field entirely when it’s not available to keep the response schema stable.

Copilot uses AI. Check for mistakes.
elif usage := data.get("usage"):
output.output_tokens = usage.get("completion_tokens")
if current_metrics := data.get("metrics"):
output.output_tokens = current_metrics.get("num_tokens_out")
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switches benchmark token counting from usage.completion_tokens to metrics.num_tokens_out only. For compatibility with servers/requests that don’t emit metrics (or for non-text modalities), it would be safer to keep a fallback to usage.completion_tokens when metrics is missing/empty so benchmark output token accounting remains correct.

Suggested change
output.output_tokens = current_metrics.get("num_tokens_out")
num_tokens_out = current_metrics.get("num_tokens_out")
if num_tokens_out is not None:
output.output_tokens = num_tokens_out
elif usage := data.get("usage"):
completion_tokens = usage.get("completion_tokens")
if completion_tokens is not None:
output.output_tokens = completion_tokens

Copilot uses AI. Check for mistakes.
Comment on lines 691 to 692
if omni_res.metrics:
final_metrics = omni_res.metrics
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new metrics propagation path (final_metrics capture and inclusion in streamed responses) should have a unit/integration test to ensure (1) metrics appear when a stage finishes, and (2) metrics are not emitted as null/empty on intermediate chunks. There are already unit tests for OmniOpenAIServingChat in this repo, so adding coverage here would help prevent regressions in the OpenAI-compatible response schema.

Copilot uses AI. Check for mistakes.
@yenuo26
Copy link
Contributor

yenuo26 commented Feb 10, 2026

Please provide the benchmark running results.

all_stages_finished[stage_id] = finished

if output_to_yield:
self._assign_output_metrics(
Copy link
Contributor

@amy-why-3459 amy-why-3459 Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to move this function to the _process_single_result function?

@kechengliu97
Copy link
Contributor Author

Test Result

execute the command the result is obtained as below, everything goes smoothly.

(l30053556) root@huawei:/nvme1n1p1/l30053556/vllm-omni# vllm bench serve   --omni   --port 45699 --endpoint /v1/chat/completions   --backend openai-chat-omni   --model /nvme1n1p1/models/Qwen3-Omni-30B-A3B-Instruct   --dataset-name random   --num-prompts 2   --random-prefix-len 5   --random-input-len 100   --random-output-len 100   --percentile-metrics ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration   --ignore-eos
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function OmniBenchmarkServingSubcommand.cmd at 0x7fa470d4d300>, omni=True, seed=0, num_prompts=2, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, random_input_len=100, random_output_len=100, random_range_ratio=0.0, random_prefix_len=5, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai-chat-omni', base_url=None, host='127.0.0.1', port=45699, endpoint='/v1/chat/completions', header=None, max_concurrency=None, model='/nvme1n1p1/models/Qwen3-Omni-30B-A3B-Instruct', input_len=None, output_len=None, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, trust_remote_code=False, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=True, percentile_metrics='ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration', metric_percentiles='99', goodput=None, request_id_prefix='bench-ea156d8d-', top_p=None, top_k=None, min_p=None, temperature=None, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None)
INFO 02-10 12:38:35 [datasets.py:612] Sampling input_len from [100, 100] and output_len from [100, 100]
WARNING: vllm bench serve no longer sets temperature==0 (greedy) in requests by default. The default will be determined on the server side and can be model/API specific. For the old behavior, include --temperature=0.
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:34<00:00, 17.03s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     2         
Failed requests:                         0         
Benchmark duration (s):                  34.06     
Request throughput (req/s):              0.06      
Peak concurrent requests:                2.00      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          32506.64  
Median E2EL (ms):                        32506.64  
P99 E2EL (ms):                           34024.48  
================== Text Result ===================
Total input tokens:                      210       
Total generated tokens:                  200       
Output token throughput (tok/s):         5.87      
Peak output token throughput (tok/s):    34.00     
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          12.04     
---------------Time to First Token----------------
Mean TTFT (ms):                          128.04    
Median TTFT (ms):                        128.04    
P99 TTFT (ms):                           161.98    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          327.06    
Median TPOT (ms):                        327.06    
P99 TPOT (ms):                           342.73    
---------------Inter-token Latency----------------
Mean ITL (ms):                           61.24     
Median ITL (ms):                         60.41     
P99 ITL (ms):                            69.58     
================== Audio Result ==================
Total audio duration generated(s):       44.51     
Total audio frames generated:            1068330   
Audio throughput(audio duration/s):      1.31      
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          1.46      
Median AUDIO_RTF:                        1.46      
P99 AUDIO_RTF:                           1.47      
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    32410.63  
Median AUDIO_TTFP (ms):                  32410.63  
P99 AUDIO_TTFP (ms):                     33947.04  
------------------Audio Duration------------------
Mean AUDIO_DURATION (s):                 22.26     
Median AUDIO_DURATION (s):               22.26     
P99 AUDIO_DURATION (s):                  23.55     
==================================================

@kechengliu97
Copy link
Contributor Author

Test Result

execute the command the result is obtained as below, everything goes smoothly.

(l30053556) root@huawei:/nvme1n1p1/l30053556/vllm-omni# vllm bench serve   --omni   --port 45699 --endpoint /v1/chat/completions   --backend openai-chat-omni   --model /nvme1n1p1/models/Qwen3-Omni-30B-A3B-Instruct   --dataset-name random   --num-prompts 2   --random-prefix-len 5   --random-input-len 100   --random-output-len 100   --percentile-metrics ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration   --ignore-eos
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function OmniBenchmarkServingSubcommand.cmd at 0x7fa470d4d300>, omni=True, seed=0, num_prompts=2, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, random_input_len=100, random_output_len=100, random_range_ratio=0.0, random_prefix_len=5, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai-chat-omni', base_url=None, host='127.0.0.1', port=45699, endpoint='/v1/chat/completions', header=None, max_concurrency=None, model='/nvme1n1p1/models/Qwen3-Omni-30B-A3B-Instruct', input_len=None, output_len=None, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, trust_remote_code=False, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=True, percentile_metrics='ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration', metric_percentiles='99', goodput=None, request_id_prefix='bench-ea156d8d-', top_p=None, top_k=None, min_p=None, temperature=None, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None)
INFO 02-10 12:38:35 [datasets.py:612] Sampling input_len from [100, 100] and output_len from [100, 100]
WARNING: vllm bench serve no longer sets temperature==0 (greedy) in requests by default. The default will be determined on the server side and can be model/API specific. For the old behavior, include --temperature=0.
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:34<00:00, 17.03s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     2         
Failed requests:                         0         
Benchmark duration (s):                  34.06     
Request throughput (req/s):              0.06      
Peak concurrent requests:                2.00      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          32506.64  
Median E2EL (ms):                        32506.64  
P99 E2EL (ms):                           34024.48  
================== Text Result ===================
Total input tokens:                      210       
Total generated tokens:                  200       
Output token throughput (tok/s):         5.87      
Peak output token throughput (tok/s):    34.00     
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          12.04     
---------------Time to First Token----------------
Mean TTFT (ms):                          128.04    
Median TTFT (ms):                        128.04    
P99 TTFT (ms):                           161.98    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          327.06    
Median TPOT (ms):                        327.06    
P99 TPOT (ms):                           342.73    
---------------Inter-token Latency----------------
Mean ITL (ms):                           61.24     
Median ITL (ms):                         60.41     
P99 ITL (ms):                            69.58     
================== Audio Result ==================
Total audio duration generated(s):       44.51     
Total audio frames generated:            1068330   
Audio throughput(audio duration/s):      1.31      
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          1.46      
Median AUDIO_RTF:                        1.46      
P99 AUDIO_RTF:                           1.47      
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    32410.63  
Median AUDIO_TTFP (ms):                  32410.63  
P99 AUDIO_TTFP (ms):                     33947.04  
------------------Audio Duration------------------
Mean AUDIO_DURATION (s):                 22.26     
Median AUDIO_DURATION (s):               22.26     
P99 AUDIO_DURATION (s):                  23.55     
==================================================

@yenuo26

@amy-why-3459
Copy link
Contributor

The method for calculating TPOT is incorrect.

@amy-why-3459
Copy link
Contributor

How can we obtain the talker's TPOT?

Propagate per-request output metrics through the Omni pipeline and include them in OpenAI-compatible responses. Added AsyncOmni._assign_output_metrics to attach stage metrics (num_tokens_in/out, stage_id, final_output_type) to OmniRequestOutput when a stage finishes and the final output is text. Extended protocol types with metrics fields (OmniChatCompletionStreamResponse, OmniChatCompletionResponse) and updated serving_chat to collect final_metrics from generator outputs and include them in both streaming/usage chunks and the final chat response. Also adjusted imports and types to use the new Omni response classes.

Signed-off-by: John Liu BUAA <[email protected]>
Delete the _assign_output_metrics method from AsyncOmni. The removed code previously inspected OrchestratorAggregator stage events to populate output_to_yield.metrics for finished requests; metrics are no longer assigned here, simplifying the class and leaving metrics handling to other parts of the codebase.

Signed-off-by: John Liu BUAA <[email protected]>
@kechengliu97
Copy link
Contributor Author

kechengliu97 commented Feb 12, 2026

The latest version with both async_chunk and non_async_chunk is tested with benchmark, the statistic result is approved by @amy-why-3459 and @yenuo26 .

Test results:

============ Serving Benchmark Result ============
Successful requests:                     2         
Failed requests:                         0         
Benchmark duration (s):                  13.87     
Request throughput (req/s):              0.14      
Peak concurrent requests:                2.00      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          13196.72  
Median E2EL (ms):                        13196.72  
P99 E2EL (ms):                           13856.47  
================== Text Result ===================
Total input tokens:                      210       
Total generated tokens:                  200       
Output token throughput (tok/s):         14.42     
Peak output token throughput (tok/s):    124.00    
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          29.56     
---------------Time to First Token----------------
Mean TTFT (ms):                          1158.24   
Median TTFT (ms):                        1158.24   
P99 TTFT (ms):                           1583.22   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.37     
Median TPOT (ms):                        21.37     
P99 TPOT (ms):                           25.65     
---------------Inter-token Latency----------------
Mean ITL (ms):                           21.16     
Median ITL (ms):                         15.65     
P99 ITL (ms):                            143.38    
================== Audio Result ==================
Total audio duration generated(s):       60.86     
Total audio frames generated:            1460640   
Audio throughput(audio duration/s):      4.39      
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.43      
Median AUDIO_RTF:                        0.43      
P99 AUDIO_RTF:                           0.44      
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    7619.68   
Median AUDIO_TTFP (ms):                  7619.68   
P99 AUDIO_TTFP (ms):                     12425.96  
------------------Audio Duration------------------
Mean AUDIO_DURATION (s):                 30.43     
Median AUDIO_DURATION (s):               30.43     
P99 AUDIO_DURATION (s):                  32.52     
==================================================

Non-async-chunk:

============ Serving Benchmark Result ============
Successful requests:                     2         
Failed requests:                         0         
Benchmark duration (s):                  38.59     
Request throughput (req/s):              0.05      
Peak concurrent requests:                2.00      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          34155.57  
Median E2EL (ms):                        34155.57  
P99 E2EL (ms):                           38501.78  
================== Text Result ===================
Total input tokens:                      210       
Total generated tokens:                  200       
Output token throughput (tok/s):         5.18      
Peak output token throughput (tok/s):    34.00     
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          10.62     
---------------Time to First Token----------------
Mean TTFT (ms):                          823.15    
Median TTFT (ms):                        823.15    
P99 TTFT (ms):                           1107.05   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          90.69     
Median TPOT (ms):                        90.69     
P99 TPOT (ms):                           93.19     
---------------Inter-token Latency----------------
Mean ITL (ms):                           89.78     
Median ITL (ms):                         60.96     
P99 ITL (ms):                            601.10    
================== Audio Result ==================
Total audio duration generated(s):       36.43     
Total audio frames generated:            874410    
Audio throughput(audio duration/s):      0.94      
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          1.89      
Median AUDIO_RTF:                        1.89      
P99 AUDIO_RTF:                           2.00      
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    34070.41  
Median AUDIO_TTFP (ms):                  34070.41  
P99 AUDIO_TTFP (ms):                     38418.92  
------------------Audio Duration------------------
Mean AUDIO_DURATION (s):                 18.22     
Median AUDIO_DURATION (s):               18.22     
P99 AUDIO_DURATION (s):                  21.59     
==================================================

Copy link
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please supplement a single UT


elif usage := data.get("usage"):
output.output_tokens = usage.get("completion_tokens")
if metrics := data.get("metrics"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set default values to avoid possible error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This selection is legal cause := firstly get the value from the formula and then transfer to the metrics param. If no attribution is found, it returns None, which makes this judgement execute no more.

tpot = 0
if output_len > 1:
latency_minus_ttft = outputs[i].latency - outputs[i].ttft
latency_minus_ttft = outputs[i].text_latency - outputs[i].ttft
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor Author

@kechengliu97 kechengliu97 Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This attribute is pre-defined in the struct, having a default value as 0

Avoid returning None when the metrics key is missing by defaulting num_tokens_out to 0. This ensures downstream code that expects a numeric value (e.g., for aggregation or arithmetic) won't error when the metric is absent.

Signed-off-by: John Liu BUAA <[email protected]>
@kechengliu97 kechengliu97 force-pushed the lkc-usage-bugfix branch 5 times, most recently from 8cdb6c7 to 68435f9 Compare February 12, 2026 06:33
Introduce comprehensive unit tests for async_request_openai_chat_omni_completions and MixRequestFuncOutput. The new tests cover output_tokens handling (including missing and multiple metric updates, mixed modalities), text_latency behavior and consistency (initialization, updates across chunks, audio-only and mixed modalities), and basic initialization of MixRequestFuncOutput. Includes MockResponse and create_sse_chunk helpers to simulate SSE streaming responses.

Signed-off-by: John Liu BUAA <[email protected]>
Introduce comprehensive unit tests for async_request_openai_chat_omni_completions and MixRequestFuncOutput. The new tests cover output_tokens handling (including missing and multiple metric updates, mixed modalities), text_latency behavior and consistency (initialization, updates across chunks, audio-only and mixed modalities), and basic initialization of MixRequestFuncOutput. Includes MockResponse and create_sse_chunk helpers to simulate SSE streaming responses.

Signed-off-by: John Liu BUAA <[email protected]>
@congw729
Copy link
Contributor

congw729 commented Feb 12, 2026

How long this test take?

@kechengliu97
Copy link
Contributor Author

How long this test take?

0.08s

@Gaohan123 Gaohan123 added the ready label to trigger buildkite CI label Feb 12, 2026
Copy link
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@Gaohan123 Gaohan123 merged commit f117a07 into vllm-project:main Feb 12, 2026
6 of 7 checks passed
@kechengliu97 kechengliu97 deleted the lkc-usage-bugfix branch February 12, 2026 09:27
YanickSchraner pushed a commit to YanickSchraner/vllm-omni that referenced this pull request Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants