Skip to content

Add Qwen3-Omni (dense) OpenVINO export and inference support#1640

Open
sgonorov wants to merge 8 commits intohuggingface:mainfrom
sgonorov:qwen-3-omni-dense-support
Open

Add Qwen3-Omni (dense) OpenVINO export and inference support#1640
sgonorov wants to merge 8 commits intohuggingface:mainfrom
sgonorov:qwen-3-omni-dense-support

Conversation

@sgonorov
Copy link
Copy Markdown

What does this PR do?

Adds OpenVINO export and inference support for the Qwen3-Omni dense multimodal model (text + vision + audio).

The model is exported as 6 sub-models: language model, text embeddings, vision patch embeddings, vision merger (with deepstack features), vision position embeddings, and audio encoder. Inference supports text-only, image, audio, and combined image+audio inputs through OVModelForVisualCausalLM.

Key implementation details:

Language model exports hidden_states alongside logits for correct stateful transformation
4D position IDs for Qwen3-Omni's multimodal RoPE (vs 3D in Qwen3-VL)
Audio inference replicates the original model's windowed chunking pipeline before calling the exported audio encoder
Vision merger uses InputEmbeddingPatcher for the position embedding sub-model to work around inspect.signature resolution issues after prior exports
Note: Requires transformers from commit 3d1a4f5e for Qwen3-Omni dense variant support (not yet in a stable release). The Talker subsystem (speech synthesis) is not present in the dense 4B variant and is not supported.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@sgonorov sgonorov marked this pull request as ready for review March 25, 2026 22:13
"transformers",
"AutoModelForImageTextToText",
)
TasksManager._CUSTOM_CLASSES[("pt", "qwen3_omni", "image-text-to-text")] = (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect a new omni task is needed

"qwen3": "optimum-intel-internal-testing/tiny-random-qwen3",
"qwen3_moe": "optimum-intel-internal-testing/tiny-random-qwen3moe",
"qwen3_vl": "optimum-intel-internal-testing/tiny-random-qwen3-vl",
"qwen3_omni": "optimum-intel-internal-testing/tiny-random-qwen3-omni",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't exist

@sgonorov sgonorov force-pushed the qwen-3-omni-dense-support branch from 03f88fb to e04d859 Compare March 30, 2026 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants