Add Qwen3-Omni (dense) OpenVINO export and inference support#1640
Open
sgonorov wants to merge 8 commits intohuggingface:mainfrom
Open
Add Qwen3-Omni (dense) OpenVINO export and inference support#1640sgonorov wants to merge 8 commits intohuggingface:mainfrom
sgonorov wants to merge 8 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Wovchena
reviewed
Mar 30, 2026
| "transformers", | ||
| "AutoModelForImageTextToText", | ||
| ) | ||
| TasksManager._CUSTOM_CLASSES[("pt", "qwen3_omni", "image-text-to-text")] = ( |
Contributor
There was a problem hiding this comment.
I suspect a new omni task is needed
| "qwen3": "optimum-intel-internal-testing/tiny-random-qwen3", | ||
| "qwen3_moe": "optimum-intel-internal-testing/tiny-random-qwen3moe", | ||
| "qwen3_vl": "optimum-intel-internal-testing/tiny-random-qwen3-vl", | ||
| "qwen3_omni": "optimum-intel-internal-testing/tiny-random-qwen3-omni", |
03f88fb to
e04d859
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds OpenVINO export and inference support for the Qwen3-Omni dense multimodal model (text + vision + audio).
The model is exported as 6 sub-models: language model, text embeddings, vision patch embeddings, vision merger (with deepstack features), vision position embeddings, and audio encoder. Inference supports text-only, image, audio, and combined image+audio inputs through OVModelForVisualCausalLM.
Key implementation details:
Language model exports hidden_states alongside logits for correct stateful transformation
4D position IDs for Qwen3-Omni's multimodal RoPE (vs 3D in Qwen3-VL)
Audio inference replicates the original model's windowed chunking pipeline before calling the exported audio encoder
Vision merger uses InputEmbeddingPatcher for the position embedding sub-model to work around inspect.signature resolution issues after prior exports
Note: Requires transformers from commit 3d1a4f5e for Qwen3-Omni dense variant support (not yet in a stable release). The Talker subsystem (speech synthesis) is not present in the dense 4B variant and is not supported.
Before submitting