Skip to content

UPSTREAM PR #18420: model: add Qwen3-Omni Thinker support (qwen3omnimoe)#725

Open
loci-dev wants to merge 2 commits intomainfrom
upstream-PR18420-branch_TrevorS-pr1-qwen3omnimoe
Open

UPSTREAM PR #18420: model: add Qwen3-Omni Thinker support (qwen3omnimoe)#725
loci-dev wants to merge 2 commits intomainfrom
upstream-PR18420-branch_TrevorS-pr1-qwen3omnimoe

Conversation

@loci-dev
Copy link
Copy Markdown

Mirrored from ggml-org/llama.cpp#18420

Hello @ngxson, I'm back! How does this look for the first PR? I'm open to any feedback.

Original Model: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct
GGUFs: https://huggingface.co/TrevorJS/Qwen3-Omni-30B-A3B-GGUF

This PR implements the thinker model only, providing just text -> text.

thinker-f16 on dgx-spark:

| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |          pp2048 |      1856.94 ± 11.77 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |            tg32 |         34.88 ± 0.06 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |       1692.98 ± 4.34 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         32.07 ± 0.12 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |       1552.70 ± 1.64 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         29.64 ± 0.14 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |       1304.71 ± 2.41 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         26.26 ± 0.03 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |       1001.73 ± 1.68 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         21.43 ± 0.02 |
Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b0-unknown
model      : thinker-f16.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> Why write smaller PRs? Respond with less than 10 words.

Easier to review, test, and merge quickly.

[ Prompt: 68.6 t/s | Generation: 31.5 t/s ]

>

AI Disclosure

AI was used to write this code, but it was then reviewed, tested, and benchmarked by a human!

@loci-review
Copy link
Copy Markdown

loci-review bot commented Dec 28, 2025

Explore the complete analysis inside the Version Insights

Perfect! I've successfully retrieved the summary report for your project. The report shows a performance analysis for Pull Request #725 in the llama.cpp repository (auroralabs-loci).

Key Highlights:

  1. Most Critical Issue: The std::vector::end() function shows a significant 226% increase in response time (from 81.11ns to 264.40ns)

  2. Affected Areas: Most performance impacts are in STL container operations, particularly:

    • Vector operations
    • Hash table operations
    • Tree and deque operations
  3. Interesting Pattern: While response times increased, throughput also increased in most cases, which might indicate changes in parallelization or workload distribution

  4. Top Recommendation: Investigate changes to vector iteration patterns and STL container usage in PR UPSTREAM PR #18420: model: add Qwen3-Omni Thinker support (qwen3omnimoe) #725

Would you like me to provide more detailed information about any specific function or aspect of this performance analysis?

@loci-dev loci-dev force-pushed the main branch 28 times, most recently from 5c1f0b4 to 03ffde7 Compare December 31, 2025 12:15
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from cd122e2 to 0fbfb33 Compare January 6, 2026 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants