model: add Qwen3-Omni Thinker support (qwen3omnimoe) by TrevorS · Pull Request #18420 · ggml-org/llama.cpp

TrevorS · 2025-12-28T01:40:59Z

Hello @ngxson, I'm back! How does this look for the first PR? I'm open to any feedback.

Original Model: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct
GGUFs: https://huggingface.co/TrevorJS/Qwen3-Omni-30B-A3B-GGUF

This PR implements the thinker model only, providing just text -> text.

thinker-f16 on dgx-spark:

| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |          pp2048 |      1856.94 ± 11.77 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |            tg32 |         34.88 ± 0.06 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |       1692.98 ± 4.34 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         32.07 ± 0.12 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |       1552.70 ± 1.64 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         29.64 ± 0.14 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |       1304.71 ± 2.41 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         26.26 ± 0.03 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |       1001.73 ± 1.68 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         21.43 ± 0.02 |

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b0-unknown
model      : thinker-f16.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> Why write smaller PRs? Respond with less than 10 words.

Easier to review, test, and merge quickly.

[ Prompt: 68.6 t/s | Generation: 31.5 t/s ]

>

AI Disclosure

AI was used to write this code, but it was then reviewed, tested, and benchmarked by a human!

arch-btw · 2025-12-28T19:58:21Z

Nice job. I think deepcopy might not be needed since you're not modifying anything nested.

convert_hf_to_gguf.py

Add support for Qwen3-Omni Thinker, a 48-layer MoE model with 128 experts (8 active per token) and optional shared expert. This enables text-only inference as the foundation for full multimodal support. Key changes: - New architecture: LLM_ARCH_QWEN3OMNIMOE - GGUF conversion with nested thinker_config handling - IMRoPE (Interleaved M-RoPE) with sections [24, 20, 20, 0] - Shared expert support in qwen3vl-moe graph builder - Reuses llm_build_qwen3vlmoe for graph construction

Address review feedback: - Rename class to Qwen3OmniMoeModel, inherit from Qwen2MoeModel - Remove __init__ override (thinker_config handled at L720-722) - Remove set_gguf_parameters (mrope_section via rope_scaling) Keep set_vocab for EOS/PAD: Qwen3-Omni lacks tokenizer.json (uses vocab.json + merges.txt), so SpecialVocab can't discover token IDs automatically.

CISC · 2026-01-01T12:27:45Z

convert_hf_to_gguf.py

+        # Qwen3-Omni lacks tokenizer.json, so token IDs must be set explicitly
+        self.gguf_writer.add_eos_token_id(151645)  # <|im_end|> - required for generation
+        self.gguf_writer.add_pad_token_id(151643)  # <|endoftext|> - required for batching


The comment is incorrect, it's because they for some reason are explicitly set to null in config.json.

CISC · 2026-01-01T12:33:30Z

src/llama-model.cpp

                        layer.ffn_up_exps   = create_tensor(tn(LLM_TENSOR_FFN_UP_EXPS,   "weight", i), {  n_embd, n_ff_exp, n_expert}, 0);
                    }
                } break;
+            case LLM_ARCH_QWEN3OMNIMOE:


Since this is only Qwen3VLMoe with shared experts added and you are adding shared experts support to qwen3vl-moe.cpp I suggest you do the same here instead of duplicating code.

ngxson · 2026-01-01T12:56:58Z

If I understand correctly, qwen3 omni is just qwen3vl with whisper encoder for audio.

There is no need to introduce this much changes. The conversation script can simply mark this info.

Beside, I don't feel comfortable using AI for anything related to mtmd, it generates too much redundant and overkill code.

I will replace this PR with another approach which is much simpler

TrevorS · 2026-01-04T01:44:21Z

appreciate the feedback, thanks!

TrevorS requested a review from CISC as a code owner December 28, 2025 01:41

TrevorS mentioned this pull request Dec 28, 2025

add model: Qwen3-Omni-30B-A3B-Instruct #18404

Closed

loci-dev mentioned this pull request Dec 28, 2025

UPSTREAM PR #18420: model: add Qwen3-Omni Thinker support (qwen3omnimoe) auroralabs-loci/llama.cpp#725

Open

github-actions bot added model Model specific python python script changes labels Dec 28, 2025

CISC reviewed Dec 30, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

TrevorS added 2 commits December 31, 2025 11:00

TrevorS force-pushed the pr1-qwen3omnimoe branch from 5969085 to d4ee36e Compare December 31, 2025 19:00

CISC reviewed Jan 1, 2026

View reviewed changes

ngxson closed this Jan 1, 2026

TrevorS deleted the pr1-qwen3omnimoe branch January 4, 2026 07:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model: add Qwen3-Omni Thinker support (qwen3omnimoe)#18420

model: add Qwen3-Omni Thinker support (qwen3omnimoe)#18420
TrevorS wants to merge 2 commits intoggml-org:masterfrom
TrevorS:pr1-qwen3omnimoe

TrevorS commented Dec 28, 2025 •

edited

Loading

Uh oh!

arch-btw commented Dec 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC Jan 1, 2026

Uh oh!

CISC Jan 1, 2026

Uh oh!

ngxson commented Jan 1, 2026

Uh oh!

TrevorS commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

TrevorS commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Disclosure

Uh oh!

arch-btw commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

CISC Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson commented Jan 1, 2026

Uh oh!

TrevorS commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TrevorS commented Dec 28, 2025 •

edited

Loading

arch-btw commented Dec 28, 2025 •

edited

Loading