[bugfix] support text + audio mixed output#843
[bugfix] support text + audio mixed output#843david6666666 merged 6 commits intovllm-project:mainfrom
Conversation
Signed-off-by: GG-li <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 966ebdf103
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Filter outputs based on requested modalites | ||
| if requested_modalities is not None and omni_outputs.final_output_type not in requested_modalities: | ||
| logger.warning(f"final output type: {omni_outputs.final_output_type} is not needed by the request") | ||
| continue |
There was a problem hiding this comment.
Preserve usage when filtering modalities
For non‑streaming requests that set modalities to exclude text (e.g., [“audio”]), this filter skips the text omni_outputs, which is the only path that populates usage/prompt_token_ids/kv_transfer_params. The response then returns zero usage even though tokens were consumed, while streaming still reports prompt tokens, so clients that depend on usage will see a regression. Consider deriving usage from omni_outputs.request_output even when text output is filtered.
Useful? React with 👍 / 👎.
|
@fake0fan PTAL |
|
Okay, I understand. However, if I really want to output both audio and text, how should I configure it? Should I set it to |
OK, I understand |
I find no modifications are required; simply set "modalities": ["text","audio"]. Just like: |
|
LGTM |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
As mentioned in #540 ,if we set output modality is audio, the output will be text + audio.
Test Plan
Server:
Curl:
Test Result
Output before fix:
Output After Fix :
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)