Skip to content

Support Qwen3 tts online serving#968

Merged
Gaohan123 merged 9 commits intovllm-project:mainfrom
linyueqian:feature/qwen3-tts-online-serving
Jan 27, 2026
Merged

Support Qwen3 tts online serving#968
Gaohan123 merged 9 commits intovllm-project:mainfrom
linyueqian:feature/qwen3-tts-online-serving

Conversation

@linyueqian
Copy link
Contributor

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR adds online serving support for Qwen3-TTS models via the /v1/audio/speech endpoint, addressing Task 1 from RFC #938.

The implementation extends the existing OpenAI-compatible speech API to support Qwen3-TTS specific parameters:

  • CustomVoice: Predefined speaker voices (Vivian, Ryan, etc.) with optional style instructions
  • VoiceDesign: Natural language voice description
  • Base: Voice cloning from reference audio

Key changes:

  • Extended OpenAICreateSpeechRequest with Qwen3-TTS parameters (task_type, language, ref_audio, ref_text, x_vector_only_mode)
  • Updated serving_speech.py to handle Qwen3-TTS prompt format and additional_information
  • Fixed scalar tensor serialization issue for audio sample rate
  • Fixed @check_model_inputs decorator compatibility issue (I could not pass the warmup if adding this line)
  • Added example client and documentation

Test Plan

1. CustomVoice Task

# Start server
vllm-omni serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
    --trust-remote-code \
    --enforce-eager \
    --omni

# Test Chinese
curl -X POST http://localhost:8000/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "input": "其实我真的有发现,我是一个特别善于观察别人情绪的人。",
        "voice": "Vivian",
        "language": "Chinese",
        "instructions": "用特别愤怒的语气说"
    }' --output output_customvoice_chinese.wav

# Test English
curl -X POST http://localhost:8000/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "input": "She said she would be here by noon.",
        "voice": "Ryan",
        "language": "English",
        "instructions": "Very happy."
    }' --output output_customvoice_english.wav

2. VoiceDesign Task

# Start server
vllm-omni serve Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \
    --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
    --trust-remote-code \
    --enforce-eager \
    --omni

# Test VoiceDesign
curl -X POST http://localhost:8000/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "input": "哥哥,你回来啦,人家等了你好久好久了,要抱抱!",
        "task_type": "VoiceDesign",
        "language": "Chinese",
        "instructions": "体现撒娇稚嫩的萝莉女声,音调偏高且起伏明显,营造出黏人、做作又刻意卖萌的听觉效果。"
    }' --output output_voicedesign.wav

3. Base Task (Voice Clone)

# Start server
vllm-omni serve Qwen/Qwen3-TTS-12Hz-1.7B-Base \
    --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
    --trust-remote-code \
    --enforce-eager \
    --omni

# Test voice cloning
curl -X POST http://localhost:8000/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Good one. Okay, fine, I am just gonna leave this sock monkey here. Goodbye.",
        "task_type": "Base",
        "language": "Auto",
        "ref_audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone_2.wav",
        "ref_text": "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you.",
        "x_vector_only_mode": false
    }' --output output_base_clone.wav

Test Result

All three task types successfully generate audio output:

CustomVoice:
output_customvoice_chinese.wav
output_customvoice_english.wav

VoiceDesign:
output_voicedesign.wav

Base (Voice Clone):
output_base_clone.wav


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 18ee8361c3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the timely work. Overall it is clear and makes sense. Please resolve several left comments for your reference. Thanks!

try:
prompt = {"prompt": request.input}
# Check if this is a Qwen3-TTS model
if self._is_tts_model():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the logic seems a bit confusing. _is_tts_model looks like all tts models should match it. But it is actually just for Qwen3-TTS, and else branch for other TTS models. In my view, we can generally check if a request includes certain parameter and process it. If there is not a certain parameter, we just skip it. That might be easier for understanding and more generalized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revise it to keep model detection only for prompt format (i think this is model-specific) while making param handling generalized.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the updated version primarily makes sense. I just thought about it. There is another way, in OmniOpenAIServingSpeech, let self.engine_client maintains a required input params for each loaded model. Then here you can generally check if each request includes a certain param required without model specific implementations. Of course this needs more work. If it is a bit heavy, we can also leave it as a new PR.



class OmniOpenAIServingSpeech(OpenAIServing, AudioMixin):
def _is_tts_model(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a valid check for tts_model_stage? I think we are going to split qwen3_tts into 2 stages and the model_stage will be changed accordingly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the stage names are changing, this check will break. is there a better way to detect Qwen3-TTS models?

@hsliuustc0106
Copy link
Collaborator

Comment on Pull Request #968

I would like to suggest that we implement parameter validations as follows:

  1. Voice Field Change: The voice field was modified from a strict Literal with specific allowed values to an unrestricted str | None, which has removed validation for allowed voice names such as Vivian, Ryan, etc.

  2. Missing Task_type Dependency Validation: There is currently no validation ensuring that the Base task has the required ref_audio parameter, nor a check confirming that the CustomVoice task is provided with a valid voice parameter. The task type can be set, but parameter requirements aren’t enforced.

  3. Language Field Validation: The language field accepts any arbitrary string without validation for supported languages like Chinese, English, Japanese, Korean, or Auto.

  4. Ref_Audio Format Validation: The ref_audio format lacks validation to confirm whether it’s a valid URL, file path, or base64 encoded data. Additionally, there is no check for supported audio formats.

  5. Cross-Parameter Validation: There is no validation ensuring that ref_text is only used with the Base task, nor that x_vector_only_mode is only meaningful with the Base task. Also, no checks are in place for conflicting parameters between different task types.

  6. Max_New_Tokens Constraints: The max_new_tokens range does not have defined min/max constraints (unlike speed, which has ge=0.25 and le=4.0).

  7. Empty/Null Checks: There are no checks in place to validate that input text is not empty or to enforce reasonable length limits on instructions.

Adding these validations would enhance the robustness of the API and provide clearer error messages to users, ultimately leading to a better user experience.

@linyueqian
Copy link
Contributor Author

Comment on Pull Request #968

I would like to suggest that we implement parameter validations as follows:

  1. Voice Field Change: The voice field was modified from a strict Literal with specific allowed values to an unrestricted str | None, which has removed validation for allowed voice names such as Vivian, Ryan, etc.
  2. Missing Task_type Dependency Validation: There is currently no validation ensuring that the Base task has the required ref_audio parameter, nor a check confirming that the CustomVoice task is provided with a valid voice parameter. The task type can be set, but parameter requirements aren’t enforced.
  3. Language Field Validation: The language field accepts any arbitrary string without validation for supported languages like Chinese, English, Japanese, Korean, or Auto.
  4. Ref_Audio Format Validation: The ref_audio format lacks validation to confirm whether it’s a valid URL, file path, or base64 encoded data. Additionally, there is no check for supported audio formats.
  5. Cross-Parameter Validation: There is no validation ensuring that ref_text is only used with the Base task, nor that x_vector_only_mode is only meaningful with the Base task. Also, no checks are in place for conflicting parameters between different task types.
  6. Max_New_Tokens Constraints: The max_new_tokens range does not have defined min/max constraints (unlike speed, which has ge=0.25 and le=4.0).
  7. Empty/Null Checks: There are no checks in place to validate that input text is not empty or to enforce reasonable length limits on instructions.

Adding these validations would enhance the robustness of the API and provide clearer error messages to users, ultimately leading to a better user experience.

i just add them in the serving layer.



class OmniOpenAIServingSpeech(OpenAIServing, AudioMixin):
def _requires_qwen3_tts_prompt(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of the names should be model agnostic

Copy link
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please supplement some Unit Tests to protect your newly important methods. Thanks!

Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Jan 27, 2026
Copy link
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the valuable work!

@Gaohan123 Gaohan123 merged commit 77ff875 into vllm-project:main Jan 27, 2026
7 checks passed
"Eric",
"Ryan",
"Aiden",
"One_Anna",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo here. Should be Ono_Anna not One_Anna

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is due to pre-commit auto fix. i will patch a fix now.

"One_Anna",
"Sohee",
}
_TTS_LANGUAGES: set[str] = {"Auto", "Chinese", "English", "Japanese", "Korean"}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acc. to https://qwen.ai/blog?id=qwen3-tts-1128 it should also support German, Italian, Portuguese, Spanish, French, and Russian

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! i will change this as well

nussejzz pushed a commit to nussejzz/vllm-omni that referenced this pull request Jan 27, 2026
@verigle
Copy link

verigle commented Jan 27, 2026

镜像可否更新 一下?

@chuanSir123
Copy link

It seems that it doesn't support concurrent processing. When running 3 concurrent tasks, they are processed sequentially, and the inference time increases linearly.
好像不支持并发哦,并发3个,都是逐个推理的,推理时间线性上升

@chenchen0611
Copy link

I encountered the same concurrent processing problem

@RinRin-32
Copy link

Thank you for the online serving support. I wanted to ask if we should also have a method to use a precomputed x-vec? That way voice cloning on a predetermined file repeatedly doesn't need to do speaker encoding repeatedly. If so I can work on the feature myself.

@linyueqian linyueqian deleted the feature/qwen3-tts-online-serving branch January 29, 2026 06:03
@linyueqian
Copy link
Contributor Author

Thank you for the online serving support. I wanted to ask if we should also have a method to use a precomputed x-vec? That way voice cloning on a predetermined file repeatedly doesn't need to do speaker encoding repeatedly. If so I can work on the feature myself.

We don't support precomputed x-vectors in the online serving API yet. Feel free to open a PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants