Skip to content

Dev/debug qwen tts#903

Merged
david6666666 merged 2 commits intovllm-project:release/v0.14.0rc1from
tzhouam:dev/debug_qwen_tts
Jan 22, 2026
Merged

Dev/debug qwen tts#903
david6666666 merged 2 commits intovllm-project:release/v0.14.0rc1from
tzhouam:dev/debug_qwen_tts

Conversation

@tzhouam
Copy link
Collaborator

@tzhouam tzhouam commented Jan 22, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR debug the qwen 3 TTS

Test Plan

tested on end2end.py

Test Result

(projects) ztc@ZTC-Desktop:~/projects/vllm-omni/examples/offline_inference/qwen3_tts$ python3 end2end.py 
WARNING 01-22 22:41:31 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
WARNING 01-22 22:41:31 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
INFO 01-22 22:41:31 [omni.py:126] Initializing stages for model: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
INFO 01-22 22:41:36 [initialization.py:232] Loaded OmniTransferConfig with 0 connector configurations
INFO 01-22 22:41:36 [omni_stage.py:109] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'llm', 'runtime': {'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'qwen3_tts', 'model_arch': 'Qwen3TTSForConditionalGeneration', 'worker_cls': 'vllm_omni.worker.gpu_generation_worker.GPUGenerationWorker', 'scheduler_cls': 'vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler', 'enforce_eager': True, 'trust_remote_code': True, 'async_scheduling': False, 'enable_prefix_caching': False, 'engine_output_type': 'audio', 'gpu_memory_utilization': 0.1, 'distributed_executor_backend': 'mp', 'max_num_batched_tokens': 1000000, 'max_num_seqs': 1}, 'final_output': True, 'final_output_type': 'audio'}
INFO 01-22 22:41:36 [omni.py:318] [Orchestrator] Waiting for 1 stages to initialize (timeout: 300s)
[Stage-0] WARNING 01-22 22:41:40 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] WARNING 01-22 22:41:41 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-22 22:41:41 [omni_stage.py:499] Starting stage worker with model: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[Stage-0] INFO 01-22 22:41:43 [initialization.py:232] Loaded OmniTransferConfig with 0 connector configurations
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[Stage-0] INFO 01-22 22:41:44 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-0] INFO 01-22 22:41:44 [configuration_qwen3_tts.py:489] talker_config is None. Initializing talker model with default values
[Stage-0] INFO 01-22 22:41:44 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-0] INFO 01-22 22:41:44 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-0] INFO 01-22 22:41:44 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-0] INFO 01-22 22:41:50 [model.py:530] Resolved architecture: Qwen3TTSForConditionalGeneration
[Stage-0] INFO 01-22 22:41:51 [model.py:1545] Using max model len 32768
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[Stage-0] INFO 01-22 22:41:52 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-0] INFO 01-22 22:41:52 [configuration_qwen3_tts.py:489] talker_config is None. Initializing talker model with default values
[Stage-0] INFO 01-22 22:41:52 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-0] INFO 01-22 22:41:52 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-0] INFO 01-22 22:41:52 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-0] INFO 01-22 22:41:52 [model.py:203] Resolved architecture: Qwen3TTSForConditionalGeneration
[Stage-0] INFO 01-22 22:41:53 [model.py:1545] Using max model len 32768
[Stage-0] INFO 01-22 22:41:53 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=1000000.
[Stage-0] WARNING 01-22 22:41:53 [scheduler.py:271] max_num_batched_tokens (1000000) exceeds max_num_seqs * max_model_len (32768). This may lead to unexpected behavior.
[Stage-0] INFO 01-22 22:41:53 [vllm.py:630] Asynchronous scheduling is disabled.
[Stage-0] WARNING 01-22 22:41:53 [vllm.py:665] Enforce eager set, overriding optimization level to -O0
[Stage-0] INFO 01-22 22:41:53 [vllm.py:765] Cudagraph is disabled under eager mode
[Stage-0] WARNING 01-22 22:41:54 [interface.py:470] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
[Stage-0] WARNING 01-22 22:41:58 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] WARNING 01-22 22:41:59 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
(EngineCore_DP0 pid=38269) [Stage-0] INFO 01-22 22:41:59 [core.py:97] Initializing a V1 LLM engine (v0.14.0) with config: model='Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice', speculative_config=None, tokenizer='Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [1000000], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=38269) [Stage-0] WARNING 01-22 22:41:59 [multiproc_executor.py:880] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
[Stage-0] WARNING 01-22 22:42:03 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] WARNING 01-22 22:42:03 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] WARNING 01-22 22:42:03 [interface.py:470] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
[Stage-0] INFO 01-22 22:42:03 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:41641 backend=nccl
[Stage-0] INFO 01-22 22:42:03 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
/bin/sh: 1: sox: not found
[2026-01-22 22:42:06] WARNING __init__.py:10: SoX could not be found!

    If you do not have SoX, proceed here:
     - - - http://sox.sourceforge.net/ - - -

    If you do (or think that you should) have SoX, double-check your
    path variables.
    

********
Warning: flash-attn is not installed. Will only run the manual PyTorch version. Please install flash-attn for faster inference.
********
 
(Worker pid=38382) [Stage-0] INFO 01-22 22:42:06 [gpu_model_runner.py:3808] Starting to load model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice...
(Worker pid=38382) [Stage-0] WARNING 01-22 22:42:06 [qwen3_tts.py:76] Flash-Attn is not installed. Using default PyTorch attention implementation.
(Worker pid=38382) `torch_dtype` is deprecated! Use `dtype` instead!
(Worker pid=38382) [Stage-0] INFO 01-22 22:42:07 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
(Worker pid=38382) [Stage-0] INFO 01-22 22:42:07 [configuration_qwen3_tts.py:489] talker_config is None. Initializing talker model with default values
(Worker pid=38382) [Stage-0] INFO 01-22 22:42:07 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
(Worker pid=38382) [Stage-0] INFO 01-22 22:42:07 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=38382) [Stage-0] INFO 01-22 22:42:07 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=38382) [Stage-0] INFO 01-22 22:42:08 [configuration_qwen3_tts_tokenizer_v2.py:156] encoder_config is None. Initializing encoder with default values
(Worker pid=38382) [Stage-0] INFO 01-22 22:42:08 [configuration_qwen3_tts_tokenizer_v2.py:159] decoder_config is None. Initializing decoder with default values
(Worker pid=38382) [Stage-0] INFO 01-22 22:42:13 [weight_utils.py:550] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 258.92it/s]
(Worker pid=38382) 
(Worker pid=38382) [Stage-0] INFO 01-22 22:42:13 [default_loader.py:291] Loading weights took 0.00 seconds
(Worker pid=38382) [Stage-0] INFO 01-22 22:42:14 [gpu_model_runner.py:3905] Model loading took 3.89 GiB memory and 7.082754 seconds
(Worker pid=38382) 2026-01-22 22:42:14,104 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker pid=38382) [Stage-0] INFO 01-22 22:42:14 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=38382) Setting `pad_token_id` to `eos_token_id`:2150 for open-end generation.
(Worker pid=38382) 2026-01-22 22:42:21,983 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker pid=38382) [Stage-0] INFO 01-22 22:42:21 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=38382) Setting `pad_token_id` to `eos_token_id`:2150 for open-end generation.
(Worker pid=38382) [Stage-0] WARNING 01-22 22:42:23 [gpu_generation_model_runner.py:384] Dummy sampler run is not implemented for generation model
(EngineCore_DP0 pid=38269) [Stage-0] INFO 01-22 22:42:23 [core.py:273] init engine (profile, create kv cache, warmup model) took 9.25 seconds
(EngineCore_DP0 pid=38269) [Stage-0] WARNING 01-22 22:42:24 [scheduler.py:171] Using custom scheduler class vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=38269) [Stage-0] WARNING 01-22 22:42:24 [core.py:130] Disabling chunked prefill for model without KVCache
(EngineCore_DP0 pid=38269) [Stage-0] WARNING 01-22 22:42:24 [interface.py:470] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore_DP0 pid=38269) [Stage-0] INFO 01-22 22:42:24 [vllm.py:630] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=38269) [Stage-0] WARNING 01-22 22:42:24 [vllm.py:672] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=38269) [Stage-0] INFO 01-22 22:42:24 [vllm.py:765] Cudagraph is disabled under eager mode
[Stage-0] INFO 01-22 22:42:25 [omni_llm.py:174] Supported_tasks: ['generate']
[Stage-0] INFO 01-22 22:42:25 [omni_stage.py:725] Max batch size: 1
INFO 01-22 22:42:25 [omni.py:311] [Orchestrator] Stage-0 reported ready
INFO 01-22 22:42:25 [omni.py:337] [Orchestrator] All stages initialized successfully
Adding requests:   0%|                                                                                                                                                                                               | 0/1 [00:00<?, ?it/s(Worker pid=38382) [Stage-0] INFO 01-22 22:42:25 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values1 [00:00<?, ?it/s, est. speed input: 0.00 unit/s, output: 0.00 unit/s]
(Worker pid=38382) Setting `pad_token_id` to `eos_token_id`:2150 for open-end generation.
INFO 01-22 22:42:29 [log_utils.py:550] {'type': 'request_level_metrics',
INFO 01-22 22:42:29 [log_utils.py:550]  'request_id': '0_c4b3c96a-b21b-4892-9f04-8d716fc1def6',
INFO 01-22 22:42:29 [log_utils.py:550]  'e2e_time_ms': 4242.513179779053,
INFO 01-22 22:42:29 [log_utils.py:550]  'e2e_tpt': 192.84150817177513,
INFO 01-22 22:42:29 [log_utils.py:550]  'e2e_total_tokens': 22,
INFO 01-22 22:42:29 [log_utils.py:550]  'transfers_total_time_ms': 0.0,
INFO 01-22 22:42:29 [log_utils.py:550]  'transfers_total_bytes': 0,
INFO 01-22 22:42:29 [log_utils.py:550]  'stages': {0: {'stage_gen_time_ms': 4221.614599227905,
INFO 01-22 22:42:29 [log_utils.py:550]                 'num_tokens_out': 0,
INFO 01-22 22:42:29 [log_utils.py:550]                 'num_tokens_in': 22}}}
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.24s/req, est. speed stage-0 tok/s: 5.19, avg e2e_lat: 0.0ms]
INFO 01-22 22:42:29 [omni.py:832] [Summary] {'e2e_requests': 1,█████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.24s/req, est. speed stage-0 tok/s: 5.19, avg e2e_lat: 0.0ms]
INFO 01-22 22:42:29 [omni.py:832]  'e2e_total_time_ms': 4243.49308013916,
INFO 01-22 22:42:29 [omni.py:832]  'e2e_sum_time_ms': 4242.513179779053,
INFO 01-22 22:42:29 [omni.py:832]  'e2e_total_tokens': 22,
INFO 01-22 22:42:29 [omni.py:832]  'e2e_avg_time_per_request_ms': 4242.513179779053,
INFO 01-22 22:42:29 [omni.py:832]  'e2e_avg_tokens_per_s': 5.185605575689866,
INFO 01-22 22:42:29 [omni.py:832]  'wall_time_ms': 4243.49308013916,
INFO 01-22 22:42:29 [omni.py:832]  'final_stage_id': {'0_c4b3c96a-b21b-4892-9f04-8d716fc1def6': 0},
INFO 01-22 22:42:29 [omni.py:832]  'stages': [{'stage_id': 0,
INFO 01-22 22:42:29 [omni.py:832]              'requests': 1,
INFO 01-22 22:42:29 [omni.py:832]              'tokens': 22,
INFO 01-22 22:42:29 [omni.py:832]              'total_time_ms': 4242.725610733032,
INFO 01-22 22:42:29 [omni.py:832]              'avg_time_per_request_ms': 4242.725610733032,
INFO 01-22 22:42:29 [omni.py:832]              'avg_tokens_per_s': 5.1853459352510365}],
INFO 01-22 22:42:29 [omni.py:832]  'transfers': []}
Adding requests:   0%|                                                                                                                                                                                               | 0/1 [00:04<?, ?it/s]
Request ID: 0_c4b3c96a-b21b-4892-9f04-8d716fc1def6, Saved audio to output_audio/output_0_c4b3c96a-b21b-4892-9f04-8d716fc1def6.wav
[Stage-0] INFO 01-22 22:42:29 [omni_stage.py:773] Received shutdown signal
(Worker pid=38382) [Stage-0] INFO 01-22 22:42:29 [multiproc_executor.py:707] Parent process exited, terminating worker

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk>
Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk>
@david6666666
Copy link
Collaborator

LGTM

@david6666666 david6666666 added the ready label to trigger buildkite CI label Jan 22, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 332f343e32

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1849 to +1856
if not local_files_only and not os.path.isdir(pretrained_model_name_or_path):
download_cache_dir = kwargs.get("cache_dir", cache_dir)
download_revision = kwargs.get("revision", revision)
download_weights_from_hf_specific(
pretrained_model_name_or_path,
cache_dir=download_cache_dir,
allow_patterns=["speech_tokenizer/*"],
revision=download_revision,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pass auth token when pre-downloading speech_tokenizer

The new pre-download step calls download_weights_from_hf_specific without propagating auth (e.g., token/use_auth_token). For gated or private HF repos, snapshot_download will 401 and raise before the later cached_file(...) call can use the provided auth token. This is a regression for users who relied on passing use_auth_token (or token) to from_pretrained to access private Qwen3 TTS checkpoints. Consider threading the token through to download_weights_from_hf_specific (or skipping the pre-download when auth is required).

Useful? React with 👍 / 👎.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsliuustc0106
Copy link
Collaborator

merge to main as well

@david6666666 david6666666 merged commit a9012a1 into vllm-project:release/v0.14.0rc1 Jan 22, 2026
5 of 7 checks passed
@tzhouam tzhouam deleted the dev/debug_qwen_tts branch February 23, 2026 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Issue when running Qwen3 TTS example and let vllm omni to download weight on the fly

4 participants