UPSTREAM PR #17675: vulkan: enable mmvq for q2_k on NVIDIA by loci-dev · Pull Request #399 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-02T03:48:18Z

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        947.00 ± 9.72 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        310.39 ± 5.99 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        273.41 ± 3.16 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        474.11 ± 6.99 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        163.50 ± 8.40 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        109.54 ± 0.35 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       1006.78 ± 5.04 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        321.59 ± 6.88 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       308.31 ± 15.10 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        524.19 ± 3.78 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        183.29 ± 9.55 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        124.85 ± 0.24 |

loci-review · 2025-12-02T11:25:22Z

Explore the complete analysis inside the Version Insights

Pull Request Performance Summary

PR #399: Enable MMVQ for Q2_K on NVIDIA GPUs

This PR modifies the ggml_vk_should_use_mmvq function in ggml-vulkan.cpp to enable matrix-matrix-vector quantized operations for Q2_K quantization on NVIDIA GPUs. The change adds an early-return condition that bypasses the existing dimension check for Q2_K types, forcing the MMVQ path regardless of matrix size.

Key Findings

Performance-Critical Area Impact:

The modification affects the Vulkan backend's kernel selection logic within the matrix multiplication pipeline. The ggml_vk_should_use_mmvq function is called during graph compilation to determine optimal kernel implementation for ggml_mul_mat operations. Benchmark data shows throughput improvements ranging from 60 tokens per second (Llama 1B Q2_K on RTX 5090) to 15 tokens per second (Qwen2.5 7B Q2_K on RTX 4070).

Inference Performance:

The change does not directly impact core inference functions such as llama_decode, llama_encode, or llama_tokenize. The modification operates at the Vulkan backend kernel selection layer, affecting only matrix multiplication operations for Q2_K quantized models on NVIDIA hardware. Token generation throughput increases by 3 to 14 percent across tested configurations, but this improvement is isolated to the Vulkan backend execution path and does not alter the response time of high-level inference functions.

Power Consumption:

Power consumption analysis is not applicable for this change as the modification affects kernel selection logic rather than computational workload. The MMVQ kernels perform functionally equivalent operations with improved efficiency through better tensor core utilization on NVIDIA GPUs.

Scope:

The change is vendor-specific (NVIDIA only) and quantization-specific (Q2_K only), with no impact on other backends, quantization formats, or GPU vendors.

vulkan: enable mmvq for q2_k on NVIDIA

70dfd26

loci-dev had a problem deploying to PROD__AL_DEMO December 2, 2025 03:48 — with GitHub Actions Failure

loci-dev force-pushed the main branch 4 times, most recently from 56f593b to eb7b6bf Compare December 2, 2025 10:10

loci-dev temporarily deployed to PROD__AL_DEMO December 2, 2025 10:44 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from eb7b6bf to 47c3f0a Compare December 2, 2025 11:09

loci-dev force-pushed the main branch 21 times, most recently from 738bfbf to f01b714 Compare December 4, 2025 09:11

loci-dev force-pushed the main branch 30 times, most recently from 3f5e1ff to 6f5d23d Compare December 9, 2025 04:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17675: vulkan: enable mmvq for q2_k on NVIDIA#399

UPSTREAM PR #17675: vulkan: enable mmvq for q2_k on NVIDIA#399
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17675-branch_jeffbolznv-mmvq_q2k

loci-dev commented Dec 2, 2025

Uh oh!

loci-review bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 2, 2025

Uh oh!

loci-review bot commented Dec 2, 2025

Pull Request Performance Summary

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants