Skip to content

UPSTREAM PR #17675: vulkan: enable mmvq for q2_k on NVIDIA#399

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17675-branch_jeffbolznv-mmvq_q2k
Open

UPSTREAM PR #17675: vulkan: enable mmvq for q2_k on NVIDIA#399
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17675-branch_jeffbolznv-mmvq_q2k

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Dec 2, 2025

Mirrored from ggml-org/llama.cpp#17675

See ggml-org/llama.cpp#16900 (comment).

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        947.00 ± 9.72 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        310.39 ± 5.99 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        273.41 ± 3.16 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        474.11 ± 6.99 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        163.50 ± 8.40 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        109.54 ± 0.35 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       1006.78 ± 5.04 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        321.59 ± 6.88 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       308.31 ± 15.10 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        524.19 ± 3.78 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        183.29 ± 9.55 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        124.85 ± 0.24 |

@loci-review
Copy link
Copy Markdown

loci-review bot commented Dec 2, 2025

Explore the complete analysis inside the Version Insights

Pull Request Performance Summary

PR #399: Enable MMVQ for Q2_K on NVIDIA GPUs

This PR modifies the ggml_vk_should_use_mmvq function in ggml-vulkan.cpp to enable matrix-matrix-vector quantized operations for Q2_K quantization on NVIDIA GPUs. The change adds an early-return condition that bypasses the existing dimension check for Q2_K types, forcing the MMVQ path regardless of matrix size.

Key Findings

Performance-Critical Area Impact:

The modification affects the Vulkan backend's kernel selection logic within the matrix multiplication pipeline. The ggml_vk_should_use_mmvq function is called during graph compilation to determine optimal kernel implementation for ggml_mul_mat operations. Benchmark data shows throughput improvements ranging from 60 tokens per second (Llama 1B Q2_K on RTX 5090) to 15 tokens per second (Qwen2.5 7B Q2_K on RTX 4070).

Inference Performance:

The change does not directly impact core inference functions such as llama_decode, llama_encode, or llama_tokenize. The modification operates at the Vulkan backend kernel selection layer, affecting only matrix multiplication operations for Q2_K quantized models on NVIDIA hardware. Token generation throughput increases by 3 to 14 percent across tested configurations, but this improvement is isolated to the Vulkan backend execution path and does not alter the response time of high-level inference functions.

Power Consumption:

Power consumption analysis is not applicable for this change as the modification affects kernel selection logic rather than computational workload. The MMVQ kernels perform functionally equivalent operations with improved efficiency through better tensor core utilization on NVIDIA GPUs.

Scope:

The change is vendor-specific (NVIDIA only) and quantization-specific (Q2_K only), with no impact on other backends, quantization formats, or GPU vendors.

@loci-dev loci-dev force-pushed the main branch 21 times, most recently from 738bfbf to f01b714 Compare December 4, 2025 09:11
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 3f5e1ff to 6f5d23d Compare December 9, 2025 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants