vulkan: Allow non-pow2 n_experts in topk_moe by jeffbolznv · Pull Request #17872 · ggml-org/llama.cpp

jeffbolznv · 2025-12-08T21:21:07Z

Saw granite-3.0-3b-a800m-instruct-Q8_0.gguf being used at https://www.phoronix.com/review/llama-cpp-vulkan-eoy2025/3, with lower than expected scaling for 5090.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -fa 1 -p 512 -n 128 --prio 1 -r 10 -m c:\models\granite-3.0-3b-a800m-instruct-Q8_0.gguf
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| granitemoe 3B Q8_0             |   3.34 GiB |     3.37 B | Vulkan     |  99 |  1 |           pp512 |    11545.75 ± 199.63 |
| granitemoe 3B Q8_0             |   3.34 GiB |     3.37 B | Vulkan     |  99 |  1 |           tg128 |        305.26 ± 3.35 |

ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| granitemoe 3B Q8_0             |   3.34 GiB |     3.37 B | Vulkan     |  99 |  1 |           pp512 |   26586.25 ± 3416.45 |
| granitemoe 3B Q8_0             |   3.34 GiB |     3.37 B | Vulkan     |  99 |  1 |           tg128 |        455.17 ± 6.23 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -fa 1 -p 512 -n 128 --prio 1 -r 10 -m c:\models\granite-3.0-3b-a800m-instruct-Q8_0.gguf
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| granitemoe 3B Q8_0             |   3.34 GiB |     3.37 B | Vulkan     |  99 |  1 |           pp512 |    11704.54 ± 179.34 |
| granitemoe 3B Q8_0             |   3.34 GiB |     3.37 B | Vulkan     |  99 |  1 |           tg128 |        325.64 ± 1.62 |

ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| granitemoe 3B Q8_0             |   3.34 GiB |     3.37 B | Vulkan     |  99 |  1 |           pp512 |    28154.26 ± 504.16 |
| granitemoe 3B Q8_0             |   3.34 GiB |     3.37 B | Vulkan     |  99 |  1 |           tg128 |       521.35 ± 10.23 |

vulkan: Allow non-pow2 n_experts in topk_moe

9fee3f5

jeffbolznv requested review from 0cc4m and ggerganov as code owners December 8, 2025 21:21

loci-dev mentioned this pull request Dec 8, 2025

UPSTREAM PR #17872: vulkan: Allow non-pow2 n_experts in topk_moe auroralabs-loci/llama.cpp#492

Open

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 9, 2025

0cc4m approved these changes Dec 13, 2025

View reviewed changes

0cc4m merged commit 07a10c1 into ggml-org:master Dec 13, 2025
77 of 78 checks passed

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026

vulkan: Allow non-pow2 n_experts in topk_moe (ggml-org#17872)

e828f33

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

vulkan: Allow non-pow2 n_experts in topk_moe (#17872)

b7b4d36

wallentri88 mentioned this pull request Feb 24, 2026

Eval bug: qwen35 and qwen35moe graph split issues (Severe PP impact, crashes) #19864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: Allow non-pow2 n_experts in topk_moe#17872

vulkan: Allow non-pow2 n_experts in topk_moe#17872
0cc4m merged 1 commit intoggml-org:masterfrom
jeffbolznv:topk_moe_np2

jeffbolznv commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeffbolznv commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants