UPSTREAM PR #16826: vulkan: remove the need for the dryrun #13

DajanaV · 2025-10-29T06:07:31Z

Allocate pipelines and descriptor sets when requested.

Reallocate the prealloc buffers when needed, and flush any pending work before reallocating.

For rms_partials and total_mul_mat_bytes, use the sizes computed the last time the graph was executed.

The dryrun is a small but consistent overhead where the GPU is idle. I get an average of maybe 1-2% improvement with it removed, though my numbers have been noisy lately.

I didn't totally rip out all the logic yet, I wanted to keep the diffs smaller to make it more clear how the new code works.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        237.16 ± 3.72 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        196.74 ± 6.25 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        128.58 ± 3.47 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       828.66 ± 20.21 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        826.18 ± 8.65 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       380.99 ± 23.22 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       263.97 ± 13.83 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       240.75 ± 14.32 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       292.99 ± 37.92 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       275.93 ± 20.96 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       357.16 ± 16.19 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |       264.03 ± 10.48 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       312.49 ± 20.09 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         92.71 ± 0.30 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         50.06 ± 0.16 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        239.51 ± 7.95 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        197.76 ± 8.86 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        128.50 ± 4.10 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       824.48 ± 11.02 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |      635.92 ± 253.03 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       384.39 ± 23.94 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       260.74 ± 22.91 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       240.84 ± 14.85 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       301.72 ± 31.62 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       282.66 ± 21.94 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       368.81 ± 12.07 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        270.71 ± 3.88 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        319.78 ± 3.61 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         92.07 ± 1.51 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         49.97 ± 0.16 |

Allocate pipelines and descriptor sets when requested. Reallocate the prealloc buffers when needed, and flush any pending work before reallocating. For rms_partials and total_mul_mat_bytes, use the sizes computed the last time the graph was executed.

loci-agentic-ai-dev · 2025-10-29T07:41:15Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: llama.cpp PR #13

Key Findings

Performance Degradations

The analysis identified minimal performance degradations in standard library functions, not in llama.cpp core components:

Response Time: unicode.cpp operator function (+0.082% degradation, 19 ns)
Throughput: std::reverse function (+0.090% degradation, 19 ns self-time)
Bottleneck: std::_Construct function (+0.131% degradation, 20 ns)

Critical Insight: These degradations are not related to PR #13 changes. All affected functions show identical assembly code between versions, indicating compiler or micro-architectural variations rather than code modifications.

Core Function Impact Assessment

Based on the project structure analysis, PR #13 does not affect llama.cpp's core inference functions:

Unaffected: llama_model_load_from_file(), llama_encode(), llama_decode(), tokenization, and sampling functions
Modified: Only Vulkan backend implementation (ggml-vulkan.cpp) - a hardware-specific optimization layer
Scope: Changes are isolated to GPU acceleration path, not affecting CPU inference or core model operations

Power Consumption Analysis

Negligible Impact: Power consumption shows minimal change across all binaries:

libllama.so: -0.0003% change (within measurement noise)
Other binaries: No measurable change (0.0%)

The power analysis confirms that PR #13's Vulkan optimizations do not significantly impact overall energy efficiency.

Technical Analysis Results

Flame Graph Analysis: The degraded unicode function shows a simple, flat execution profile (single 18 ns frame) with no branching or function calls, confirming the performance change is at the instruction level rather than algorithmic.

CFG Comparison: Control flow graphs are byte-for-byte identical between versions, eliminating instruction count or logic changes as causes. The 0.02 ns difference represents micro-architectural effects rather than code quality regression.

GitHub Code Review - Critical Issues Identified

PR #13 introduces significant architectural changes to Vulkan backend resource management:

Memory Management Complexity: Removes predictable dryrun allocation in favor of dynamic mid-execution reallocation
Performance Risk: Buffer reallocations during inference execution could cause latency spikes
Resource Contention: Added synchronization points (ggml_vk_wait_for_fence) may reduce parallelism
State Management: Multiple allocation paths increase complexity and potential for race conditions

Actionable Steps

Immediate Actions (High Priority)

Monitor Vulkan Performance: Implement telemetry for buffer reallocation frequency and timing in production workloads
Memory Allocation Profiling: Track GPU memory fragmentation patterns under the new dynamic allocation strategy
Latency Monitoring: Measure 95th/99th percentile inference latencies to detect allocation-induced spikes

Medium-Term Optimizations

Buffer Pool Implementation: Add buffer pooling to reduce allocation overhead in Vulkan backend
Allocation Hysteresis: Implement thresholds to prevent oscillating buffer reallocations
Predictive Sizing: Use historical allocation patterns to better predict buffer requirements

Investigation Areas

Compiler Analysis: The unicode function degradations warrant compiler flag verification between builds
Vulkan Backend Testing: Comprehensive testing of edge cases in dynamic allocation scenarios
Memory Usage Patterns: Analysis of peak memory usage under new allocation strategy

Overall Assessment

Change Impact Evaluation

Positive Aspects:

PR UPSTREAM PR #16826: vulkan: remove the need for the dryrun #13 successfully eliminates Vulkan dryrun overhead (1-2% performance improvement reported)
Changes are architecturally isolated to GPU backend, preserving core inference stability
Power consumption remains essentially unchanged

Risk Factors:

Increased complexity in Vulkan memory management introduces potential performance variability
Dynamic allocation strategy may cause unpredictable latency patterns
More complex code paths increase maintenance burden for GPU-specific optimizations

Maintainability and Future Considerations

Code Quality: The PR maintains good separation of concerns by isolating changes to the Vulkan backend. However, the shift from predictable to dynamic resource allocation increases cognitive complexity for future maintainers.

Performance Predictability: While average performance improves, the new allocation strategy may introduce performance variability that could affect real-time inference applications.

Scalability: The dynamic allocation approach may not scale well under high-concurrency scenarios where multiple inference requests compete for GPU resources.

Recommendation: The changes are technically sound and provide measurable benefits. However, production deployment should include comprehensive monitoring of allocation patterns and latency distributions to ensure the performance improvements are sustained across diverse workloads without introducing unacceptable variability.

DajanaV force-pushed the main branch 2 times, most recently from 1983956 to 326a60a Compare October 29, 2025 12:13

DajanaV added the dev-stale Stale dev environment — dashboard not accessible label Oct 30, 2025

DajanaV closed this Oct 30, 2025

DajanaV deleted the upstream-PR16826-branch_jeffbolznv-dryrun branch October 30, 2025 15:25

DajanaV deleted the branch main October 30, 2025 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16826: vulkan: remove the need for the dryrun #13

UPSTREAM PR #16826: vulkan: remove the need for the dryrun #13

Uh oh!

DajanaV commented Oct 29, 2025

Uh oh!

loci-agentic-ai-dev bot commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #16826: vulkan: remove the need for the dryrun #13

UPSTREAM PR #16826: vulkan: remove the need for the dryrun #13

Uh oh!

Conversation

DajanaV commented Oct 29, 2025

Uh oh!

loci-agentic-ai-dev bot commented Oct 29, 2025

Performance Analysis Summary: llama.cpp PR #13

Key Findings

Performance Degradations

Core Function Impact Assessment

Power Consumption Analysis

Technical Analysis Results

GitHub Code Review - Critical Issues Identified

Actionable Steps

Immediate Actions (High Priority)

Medium-Term Optimizations

Investigation Areas

Overall Assessment

Change Impact Evaluation

Maintainability and Future Considerations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants