[Test] Add precision test cases for Qwen3-Omni-30B-A3B-Instruct in CI#828
[Test] Add precision test cases for Qwen3-Omni-30B-A3B-Instruct in CI#828hsliuustc0106 merged 36 commits intovllm-project:mainfrom
Conversation
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 769df694b1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
| audio_content = convert_audio_to_text(audio_data) | ||
| print(f"text content is: {text_content}") | ||
| print(f"audio content is: {audio_content}") | ||
| assert cosine_similarity_text(audio_content.lower(), text_content.lower()) > 0.9, ( |
There was a problem hiding this comment.
Text input scenario similarity: 1
Audio input scenario average similarity: 0.9425
Image input scenario average similarity: 0.9870
Video input scenario average similarity: 0.9655
Audio truncation error scenario average similarity: 0.6484
Considering factors such as errors in Whisper model recognition, the preset threshold is set to 0.9.
If audio truncation involves only a few tokens—for example, only the last character is truncated—it may result in a similarity score greater than 0.9. It is recommended to address such missed detection errors by adding a method to compare the last few characters in a subsequent PR.
|
amd-ci failed |
|
@tjtanaa Hi, TJian. Could you help review the failed case in the AMD buildkite test? We are planning to add the precision verification tests to Qwen3-Omni-30B-A3B-Instruct. But one case shows a strange error. |
|
ok let me check |
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Currently, there is an issue with garbled output on AMD machines causing AMD-CI failures. I have modified the test-amd configuration to temporarily skip this test case in AMD environments. This case will be re-enabled once the garbled output issue is resolved. |
.buildkite/test-amd.yaml
Outdated
| - export MIOPEN_DEBUG_CONV_DIRECT=0 | ||
| - export MIOPEN_DEBUG_CONV_GEMM=0 | ||
| - pytest -s -v tests/e2e/offline_inference/test_qwen3_omni.py tests/e2e/online_serving/test_qwen3_omni.py | ||
| - pytest -s -v tests/e2e/offline_inference/test_qwen3_omni.py tests/e2e/online_serving/test_qwen3_omni.py::test_video_to_audio_concurrent |
There was a problem hiding this comment.
@yenuo26 can you do this in the tests/e2e/online_serving/test_qwen3_omni.py instead.
add the decorator @skipif() to the test that is failing?
|
@yenuo26 can you add [ROCm] to the issue title? |
|
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
|
fix precommits |
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
fixed |
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
|
fix ci or retest it again? I have a question: how many omni-servers have been launched for the omni model test H100 workflow、 |
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
|
|
||
| # Verify text output success | ||
| assert text_content is not None and len(text_content) >= 2, "No text output is generated" | ||
| assert "square" in text_content.lower(), "The output do not contain keywords." |
There was a problem hiding this comment.
(Worker_TP0 pid=12753) [Stage-0] INFO 01-22 16:33:25 [multiproc_executor.py:707] Parent process exited, terminating worker
(Worker_TP1 pid=12754) [Stage-0] INFO 01-22 16:33:25 [multiproc_executor.py:707] Parent process exited, terminating worker
[Stage-0] INFO 01-22 16:33:27 [omni_stage.py:1498] Stage worker exiting
PASSED
=========================================================================== FAILURES ===========================================================================
___________________________________________________________ test_mix_to_text_audio_001[omni_server0] ___________________________________________________________
client = <openai.OpenAI object at 0x79226172b860>, omni_server = <tests.conftest.OmniServer object at 0x79226172bda0>
@pytest.mark.skipif(is_rocm(), reason="Test skipped on AMD environment due to known output issues")
@pytest.mark.parametrize("omni_server", test_params, indirect=True)
def test_mix_to_text_audio_001(client: openai.OpenAI, omni_server) -> None:
"""
Test multi-modal input processing and text/audio output generation via OpenAI API.
Deploy Setting: default yaml
Input Modal: text + audio + video + image
Output Modal: text + audio
Input Setting: stream=True
Datasets: single request
"""
Test single completion
e2e_list = list()
video_data_url = f"data:video/mp4;base64,{generate_synthetic_video(224, 224, 300)['base64']}"
image_data_url = f"data:image/jpeg;base64,{generate_synthetic_image(224, 224)['base64']}"
audio_data_url = f"data:audio/wav;base64,{generate_synthetic_audio(5, 1)['base64']}"
messages = dummy_messages_from_mix_data(
system_prompt=get_system_prompt(),
video_data_url=video_data_url,
image_data_url=image_data_url,
audio_data_url=audio_data_url,
content_text=get_prompt("mix"),
)
Test single completion
start_time = time.perf_counter()
chat_completion = client.chat.completions.create(model=omni_server.model, messages=messages, stream=True)
text_content = ""
audio_data = None
for chunk in chat_completion:
for choice in chunk.choices:
if hasattr(choice, "delta"):
content = getattr(choice.delta, "content", None)
else:
content = None
modality = getattr(chunk, "modality", None)
if modality == "audio" and content:
Audio chunk - content
if audio_data is None:
audio_data = content
else:
audio_data += content
elif modality == "text" and content:
Text chunk - accumulate text content
text_content += content if content else ""
Verify E2E
current_e2e = time.perf_counter() - start_time
print(f"the request e2e is: {current_e2e}")
TODO: Verify the E2E latency after confirmation baseline.
e2e_list.append(current_e2e)
print(f"the avg e2e is: {sum(e2e_list) / len(e2e_list)}")
Verify all completions succeeded
assert audio_data is not None, "No audio output is generated"
Verify text output success
assert text_content is not None and len(text_content) >= 2, "No text output is generated"
assert "square" in text_content.lower(), "The output do not contain keywords."
E AssertionError: The output do not contain keywords.
E assert 'square' in 'the audio contains the sound of flowing water.\n\nthe image displays five colored spheres against a black background:\n* a yellow sphere.\n* a green sphere.\n* a purple sphere.\n* two brown spheres.\n\nthese spheres move around the screen, sometimes overlapping with each other.'
E + where 'the audio contains the sound of flowing water.\n\nthe image displays five colored spheres against a black background:\n* a yellow sphere.\n* a green sphere.\n* a purple sphere.\n* two brown spheres.\n\nthese spheres move around the screen, sometimes overlapping with each other.' = <built-in method lower of str object at 0x7927a613ee70>()
E + where <built-in method lower of str object at 0x7927a613ee70> = 'The audio contains the sound of flowing water.\n\nThe image displays five colored spheres against a black background:\n* A yellow sphere.\n* A green sphere.\n* A purple sphere.\n* Two brown spheres.\n\nThese spheres move around the screen, sometimes overlapping with each other.'.lower
There was a problem hiding this comment.
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
|
try the ci for 5 times |

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
This PR aims to add CI tests for the precision test cases of Qwen3-Omni-30B-A3B-Instruct.
design and plan, please refer to the #400
After the modifications, the total execution time for the two Qwen3-omni online test cases is 7 minutes.
Test Plan
pytest -sv test_qwen3_omni.py --html=report.html --self-contained-html --capture=sys
Test Result
CI Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)