2.0.4: Support Qwen3-ASR and Qwen3-TTS w/ streaming#73
Merged
SearchSavior merged 27 commits intomainfrom Apr 5, 2026
Merged
Conversation
- add entrypoint in qwen3_tts.py to help test performance and debug.
This reverts commit ea180f4.
- refresh documentation - provide qwen-tts examples
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Since January I have been working on a full openvino implementation of qwen3-asr and qwen3-tts. Now, support has arrived.
Instead of using the qwen-tts official repo I decided to attempt rebuilding in pytorch from scratch to really go deep into optimizing with openvino, from scratch. This required an entire separate codebase I need to get cleaned up. We'll need it as reference to improve this implementation because I did not use transformers anywhere in the pipeline, save
AutoTokenizer.OpenArc now supports
Base: voice cloning
VoiceDesign: using only text to describe a voice
CustomVoice: using voices trained/implemented by Qwen
https://huggingface.co/collections/Echo9Zulu/qwen3-tts-openvino
https://huggingface.co/Echo9Zulu/Qwen3-ASR-0.6B-INT8_ASYM-OpenVINO
The workflow for qwen-asr and qwen-tts follows everything else in Openarc, and integrates seamlessly into the existing user flow.
openarc addhas some new options for eachmodel_type. However, qwen-tts has many knobs I haven't worked out how to implement in a way that is as easy as everything else.Let's look at an example for voice cloning using the openai python library for voice cloning;
VERY MUCH set it once and forget it, leaving all managment to a downstream application. I have some ideas about how to make this easier to configure, but for now,this a very good initial commit.
The other example I suggest trying is demos/talk_to_llm.py, which supports voice clone streaming with an llm in the loop. A little cumbersome vs most of the other tools, but it checks all the "does everything work" boxes.
Hardware Requirements
OpenArc implementation of qwen3-tts makes heavy use of dynamic shapes and cannot support NPU yet. However, once I get the source repo in order it should be possible. HOWEVER, after studying this... our community should put effort into a different tts solution for NPU. For low powered devices, even the openvino optimizations are not fast enough for real time due to computational complexity of predicting codebooks. It can't be paralelized, is data dependent... I have more notes on this I have yet to synthesize into a writeup. So thats next ;)
Right now, the entire model does not run on GPU device. Instead, I found through testing that openvino provides better CPU kernels for some of the sub-model ops than GPU, which all worked mostly to limit how much time gets spent predicting new audio codebooks.
Performance
I'll update this soon