vulkan: fuse rms_norm + mul + rope (+ view + set_rows) #16977

jeffbolznv · 2025-11-03T19:05:01Z

This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.

Helps a couple percent on models where it applies.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       269.45 ± 11.28 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        271.92 ± 3.88 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        273.37 ± 1.46 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        274.13 ± 1.33 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        274.37 ± 1.21 |

build: 1ae74882f (6913)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen3-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        134.55 ± 3.63 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.55 ± 0.32 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.51 ± 0.51 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.49 ± 0.43 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.57 ± 0.33 |

build: 1ae74882f (6913)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Ring-mini-2.0-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       472.71 ± 38.19 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       478.29 ± 10.31 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       468.20 ± 16.25 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        483.64 ± 2.21 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        484.80 ± 2.20 |

build: 1ae74882f (6913)

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       269.94 ± 17.36 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        275.08 ± 2.82 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        276.42 ± 1.60 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        276.48 ± 1.57 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        277.64 ± 0.90 |

build: b74de9b7b (6915)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen3-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        137.12 ± 3.41 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        137.25 ± 3.01 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        138.57 ± 0.47 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        138.33 ± 0.56 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        138.56 ± 0.53 |

build: b74de9b7b (6915)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Ring-mini-2.0-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       482.35 ± 47.67 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       488.51 ± 11.51 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        487.90 ± 6.93 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        494.89 ± 4.11 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        496.09 ± 5.04 |

build: b74de9b7b (6915)

jeffbolznv · 2025-11-03T19:07:15Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

@@ -14236,30 +14492,18 @@ size_t comp_size;
 size_t comp_nb[GGML_MAX_DIMS];
 size_t check_counter = 0;
 static void ggml_vk_check_results_0(ggml_backend_vk_context * ctx, ggml_cgraph * cgraph, int tensor_idx) {


The changes from here to the end of the file are the same as in #16919, I'll rebase once that lands.

This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.

jeffbolznv requested review from 0cc4m and slaren as code owners November 3, 2025 19:05

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Nov 3, 2025

jeffbolznv commented Nov 3, 2025

View reviewed changes

DajanaV mentioned this pull request Nov 3, 2025

UPSTREAM PR #16977: vulkan: fuse rms_norm + mul + rope (+ view + set_rows) auroralabs-loci/llama.cpp#53

Closed

vulkan: fuse rms_norm + mul + rope (+ view + set_rows)

7ca2c06

This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.

jeffbolznv force-pushed the rmsnorm_rope_fusion branch from e565af7 to 7ca2c06 Compare November 4, 2025 20:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: fuse rms_norm + mul + rope (+ view + set_rows) #16977

vulkan: fuse rms_norm + mul + rope (+ view + set_rows) #16977

Uh oh!

jeffbolznv commented Nov 3, 2025

Uh oh!

jeffbolznv Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vulkan: fuse rms_norm + mul + rope (+ view + set_rows) #16977

Are you sure you want to change the base?

vulkan: fuse rms_norm + mul + rope (+ view + set_rows) #16977

Uh oh!

Conversation

jeffbolznv commented Nov 3, 2025

Uh oh!

jeffbolznv Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant