UPSTREAM PR #18420: model: add Qwen3-Omni Thinker support (qwen3omnimoe) by loci-dev · Pull Request #725 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-28T02:16:20Z

Hello @ngxson, I'm back! How does this look for the first PR? I'm open to any feedback.

Original Model: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct
GGUFs: https://huggingface.co/TrevorJS/Qwen3-Omni-30B-A3B-GGUF

This PR implements the thinker model only, providing just text -> text.

thinker-f16 on dgx-spark:

| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |          pp2048 |      1856.94 ± 11.77 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |            tg32 |         34.88 ± 0.06 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |       1692.98 ± 4.34 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         32.07 ± 0.12 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |       1552.70 ± 1.64 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         29.64 ± 0.14 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |       1304.71 ± 2.41 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         26.26 ± 0.03 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |       1001.73 ± 1.68 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         21.43 ± 0.02 |

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b0-unknown
model      : thinker-f16.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> Why write smaller PRs? Respond with less than 10 words.

Easier to review, test, and merge quickly.

[ Prompt: 68.6 t/s | Generation: 31.5 t/s ]

>

AI Disclosure

AI was used to write this code, but it was then reviewed, tested, and benchmarked by a human!

loci-review · 2025-12-28T03:07:02Z

Explore the complete analysis inside the Version Insights

Perfect! I've successfully retrieved the summary report for your project. The report shows a performance analysis for Pull Request #725 in the llama.cpp repository (auroralabs-loci).

Key Highlights:

Most Critical Issue: The std::vector::end() function shows a significant 226% increase in response time (from 81.11ns to 264.40ns)
Affected Areas: Most performance impacts are in STL container operations, particularly:
- Vector operations
- Hash table operations
- Tree and deque operations
Interesting Pattern: While response times increased, throughput also increased in most cases, which might indicate changes in parallelization or workload distribution
Top Recommendation: Investigate changes to vector iteration patterns and STL container usage in PR UPSTREAM PR #18420: model: add Qwen3-Omni Thinker support (qwen3omnimoe) #725

Would you like me to provide more detailed information about any specific function or aspect of this performance analysis?

loci-dev temporarily deployed to PROD__AL_DEMO December 28, 2025 02:16 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 28 times, most recently from 5c1f0b4 to 03ffde7 Compare December 31, 2025 12:15

loci-dev force-pushed the main branch 30 times, most recently from cd122e2 to 0fbfb33 Compare January 6, 2026 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18420: model: add Qwen3-Omni Thinker support (qwen3omnimoe)#725

UPSTREAM PR #18420: model: add Qwen3-Omni Thinker support (qwen3omnimoe)#725
loci-dev wants to merge 2 commits intomainfrom
upstream-PR18420-branch_TrevorS-pr1-qwen3omnimoe

loci-dev commented Dec 28, 2025

Uh oh!

loci-review bot commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 28, 2025

AI Disclosure

Uh oh!

loci-review bot commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants