[Bugfix] fix precision issues of qwen3-omni when enable async_chunk without system prompt#1288
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c77d471ad7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
c77d471 to
35feba7
Compare
There was a problem hiding this comment.
Pull request overview
This PR aims to fix precision issues in Qwen3-Omni when async_chunk is enabled and requests do not include a system prompt (per vllm-omni issue #1278), by adjusting how decode-time state (e.g., num_processed_tokens) is initialized and tracked.
Changes:
- Introduces a
decode_flagto make a decode-timenum_processed_tokensadjustment happen only once. - Threads an
update_dictthrough the decode preprocess call path (but currently with an implementation issue that drops state).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if not info_dict.get("decode_flag", False): | ||
| info_dict["num_processed_tokens"] = len(info_dict.get("thinker_input_ids", [])) + 1 | ||
| update_dict["decode_flag"] = True | ||
|
|
||
| last_talker_hidden, text_step, update_dict = self.talker_preprocess_decode( | ||
| input_ids, input_embeds, **info_dict | ||
| input_ids, input_embeds, update_dict, **info_dict | ||
| ) |
There was a problem hiding this comment.
In the decode path you set update_dict["decode_flag"] = True to ensure the num_processed_tokens adjustment happens only once, but talker_preprocess_decode immediately reinitializes update_dict and the returned dict overwrites the caller’s update_dict. This drops decode_flag, so the adjustment will repeat on every decode step. Consider having talker_preprocess_decode mutate/extend the passed-in update_dict (or merge the caller’s entries into the returned dict) so decode_flag persists across decode iterations.
| if not info_dict.get("decode_flag", False): | ||
| info_dict["num_processed_tokens"] = len(info_dict.get("thinker_input_ids", [])) + 1 | ||
| update_dict["decode_flag"] = True |
There was a problem hiding this comment.
This change introduces new per-request decode state (decode_flag) that is specifically meant to fix async_chunk behavior when there is no system prompt, but there doesn’t appear to be a regression test covering the “async_chunk + no system message” case. There are existing Qwen3-Omni E2E tests that always include a system prompt; adding a variant without the system message would help prevent future regressions.
4771eb7 to
8f79a81
Compare
|
Could you also attach some test results? |
Gaohan123
left a comment
There was a problem hiding this comment.
I wonder what is the scenario of non system prompt? It seems that it is unstable without a system prompt for qwen-omni
user may forget to send request with system prompt |
6b93c07 to
9e3c791
Compare
| from vllm.assets.audio import AudioAsset | ||
| from vllm.utils.argparse_utils import FlexibleArgumentParser | ||
|
|
||
| # Modify OpenAI's API key and API base to use vLLM's API server. |
There was a problem hiding this comment.
why you need to modify this?
There was a problem hiding this comment.
because here we hard code port, I think its better to let user to choose their own port, so I added a args for port, default is same as before 8091. I always meet port conflicts problem ...
There was a problem hiding this comment.
maybe it's better to update all openai_chat_completion_client_for_multimodal_generation?
There was a problem hiding this comment.
Maybe later, not this PR
vllm_omni/entrypoints/chat_utils.py
Outdated
| "content": default_qwen_omni_system_prompt, | ||
| } | ||
|
|
||
| logger.info("injecting system prompt for Qwen-Omni model") |
There was a problem hiding this comment.
I think you can show full system prompt here for clearance
| # Modify OpenAI's API key and API base to use vLLM's API server. | ||
| openai_api_key = "EMPTY" | ||
| port = getattr(args, "port", 8091) | ||
| openai_api_base = f"http://localhost:{port}/v1" |
There was a problem hiding this comment.
Can the --host parameter be added?
54c7a07 to
ea74c67
Compare
…stem_prompt Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
ea74c67 to
5e8b334
Compare
…ithout system prompt (vllm-project#1288) Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
fix precision issues of qwen3-omni when enable async_chunk without system prompt
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
fix precision issues of qwen3-omni when enable async_chunk without system prompt
solve #1278
inject system prompt for qwen omni models if user didn't send request with system prompt
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)