Skip to content

2.0.4: Support Qwen3-ASR and Qwen3-TTS w/ streaming#73

Merged
SearchSavior merged 27 commits intomainfrom
2.0.4
Apr 5, 2026
Merged

2.0.4: Support Qwen3-ASR and Qwen3-TTS w/ streaming#73
SearchSavior merged 27 commits intomainfrom
2.0.4

Conversation

@SearchSavior
Copy link
Copy Markdown
Owner

Since January I have been working on a full openvino implementation of qwen3-asr and qwen3-tts. Now, support has arrived.

Instead of using the qwen-tts official repo I decided to attempt rebuilding in pytorch from scratch to really go deep into optimizing with openvino, from scratch. This required an entire separate codebase I need to get cleaned up. We'll need it as reference to improve this implementation because I did not use transformers anywhere in the pipeline, save AutoTokenizer.

OpenArc now supports

Base: voice cloning

VoiceDesign: using only text to describe a voice

CustomVoice: using voices trained/implemented by Qwen

https://huggingface.co/collections/Echo9Zulu/qwen3-tts-openvino

https://huggingface.co/Echo9Zulu/Qwen3-ASR-0.6B-INT8_ASYM-OpenVINO


The workflow for qwen-asr and qwen-tts follows everything else in Openarc, and integrates seamlessly into the existing user flow. openarc add has some new options for each model_type. However, qwen-tts has many knobs I haven't worked out how to implement in a way that is as easy as everything else.

Let's look at an example for voice cloning using the openai python library for voice cloning;

import base64
import os
from pathlib import Path

from openai import OpenAI

API_KEY = os.environ["OPENARC_API_KEY"]

BASE_URL = "http://localhost:8003/v1"
MODEL = "voice_clone"

REF_WAV = Path("reference.wav")

text = "Echo9Zulu is an insane person"

ref_audio_b64 = base64.b64encode(REF_WAV.read_bytes()).decode("ascii")
ref_text = "Transcript of what is spoken in the reference WAV, for ICL."

qwen3_tts = {
    "input": text,
    "ref_audio_b64": ref_audio_b64,  # audio we want to clone
    "ref_text": ref_text,                           # transcription of audio we want to clone
    "x_vector_only": False,                   # use if you don't want a transcription, or can't provide one, or are testing something wild
    "language": "english",
    "max_new_tokens": 2048,
    "do_sample": True,
    "top_k": 50,
    "top_p": 1.0,
    "temperature": 0.9,
    "repetition_penalty": 1.05,
    "subtalker_do_sample": True,
    "subtalker_top_k": 50,
    "subtalker_top_p": 1.0,
    "subtalker_temperature": 0.9,
    "stream": True,
    "stream_chunk_frames": 300,   # amount of frames to stream in a chunk. 300 came from the official impl
    "stream_left_context": 25,        # amount of frames kept from the last chunk
}

client = OpenAI(base_url=BASE_URL, api_key=API_KEY)


response = client.audio.speech.create(
    model=MODEL,
    input=text,
    voice=MODEL, 
    response_format="wav",
    extra_body={"openarc_tts": {"qwen3_tts": qwen3_tts}},
)

Path("out_speech.wav").write_bytes(response.content)
print("Wrote out_speech.wav")
  • Everything is set at request time; the engine is stateless, one copy of the model sits in memory, and the API design builds around the shape of its inputs. Ultimately audio language models do magic to model other data than text, but in the machine room, everything still flows in and out of requests, making the code we use to control model behavior quite dynamic.

VERY MUCH set it once and forget it, leaving all managment to a downstream application. I have some ideas about how to make this easier to configure, but for now,this a very good initial commit.

The other example I suggest trying is demos/talk_to_llm.py, which supports voice clone streaming with an llm in the loop. A little cumbersome vs most of the other tools, but it checks all the "does everything work" boxes.


Hardware Requirements

OpenArc implementation of qwen3-tts makes heavy use of dynamic shapes and cannot support NPU yet. However, once I get the source repo in order it should be possible. HOWEVER, after studying this... our community should put effort into a different tts solution for NPU. For low powered devices, even the openvino optimizations are not fast enough for real time due to computational complexity of predicting codebooks. It can't be paralelized, is data dependent... I have more notes on this I have yet to synthesize into a writeup. So thats next ;)

Right now, the entire model does not run on GPU device. Instead, I found through testing that openvino provides better CPU kernels for some of the sub-model ops than GPU, which all worked mostly to limit how much time gets spent predicting new audio codebooks.


Performance

I'll update this soon

@SearchSavior SearchSavior merged commit 9cae2c2 into main Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant