Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Oct 29, 2025

Mirrored from ggml-org/llama.cpp#16826

Allocate pipelines and descriptor sets when requested.

Reallocate the prealloc buffers when needed, and flush any pending work before reallocating.

For rms_partials and total_mul_mat_bytes, use the sizes computed the last time the graph was executed.

The dryrun is a small but consistent overhead where the GPU is idle. I get an average of maybe 1-2% improvement with it removed, though my numbers have been noisy lately.

I didn't totally rip out all the logic yet, I wanted to keep the diffs smaller to make it more clear how the new code works.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        237.16 ± 3.72 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        196.74 ± 6.25 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        128.58 ± 3.47 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       828.66 ± 20.21 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        826.18 ± 8.65 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       380.99 ± 23.22 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       263.97 ± 13.83 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       240.75 ± 14.32 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       292.99 ± 37.92 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       275.93 ± 20.96 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       357.16 ± 16.19 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |       264.03 ± 10.48 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       312.49 ± 20.09 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         92.71 ± 0.30 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         50.06 ± 0.16 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        239.51 ± 7.95 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        197.76 ± 8.86 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        128.50 ± 4.10 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       824.48 ± 11.02 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |      635.92 ± 253.03 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       384.39 ± 23.94 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       260.74 ± 22.91 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       240.84 ± 14.85 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       301.72 ± 31.62 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       282.66 ± 21.94 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       368.81 ± 12.07 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        270.71 ± 3.88 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        319.78 ± 3.61 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         92.07 ± 1.51 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         49.97 ± 0.16 |

Allocate pipelines and descriptor sets when requested.

Reallocate the prealloc buffers when needed, and flush any pending work
before reallocating.

For rms_partials and total_mul_mat_bytes, use the sizes computed the last time
the graph was executed.
@loci-agentic-ai-dev
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: llama.cpp PR #13

Key Findings

Performance Degradations

The analysis identified minimal performance degradations in standard library functions, not in llama.cpp core components:

  • Response Time: unicode.cpp operator function (+0.082% degradation, 19 ns)
  • Throughput: std::reverse function (+0.090% degradation, 19 ns self-time)
  • Bottleneck: std::_Construct function (+0.131% degradation, 20 ns)

Critical Insight: These degradations are not related to PR #13 changes. All affected functions show identical assembly code between versions, indicating compiler or micro-architectural variations rather than code modifications.

Core Function Impact Assessment

Based on the project structure analysis, PR #13 does not affect llama.cpp's core inference functions:

  • Unaffected: llama_model_load_from_file(), llama_encode(), llama_decode(), tokenization, and sampling functions
  • Modified: Only Vulkan backend implementation (ggml-vulkan.cpp) - a hardware-specific optimization layer
  • Scope: Changes are isolated to GPU acceleration path, not affecting CPU inference or core model operations

Power Consumption Analysis

Negligible Impact: Power consumption shows minimal change across all binaries:

  • libllama.so: -0.0003% change (within measurement noise)
  • Other binaries: No measurable change (0.0%)

The power analysis confirms that PR #13's Vulkan optimizations do not significantly impact overall energy efficiency.

Technical Analysis Results

Flame Graph Analysis: The degraded unicode function shows a simple, flat execution profile (single 18 ns frame) with no branching or function calls, confirming the performance change is at the instruction level rather than algorithmic.

CFG Comparison: Control flow graphs are byte-for-byte identical between versions, eliminating instruction count or logic changes as causes. The 0.02 ns difference represents micro-architectural effects rather than code quality regression.

GitHub Code Review - Critical Issues Identified

PR #13 introduces significant architectural changes to Vulkan backend resource management:

  1. Memory Management Complexity: Removes predictable dryrun allocation in favor of dynamic mid-execution reallocation
  2. Performance Risk: Buffer reallocations during inference execution could cause latency spikes
  3. Resource Contention: Added synchronization points (ggml_vk_wait_for_fence) may reduce parallelism
  4. State Management: Multiple allocation paths increase complexity and potential for race conditions

Actionable Steps

Immediate Actions (High Priority)

  1. Monitor Vulkan Performance: Implement telemetry for buffer reallocation frequency and timing in production workloads
  2. Memory Allocation Profiling: Track GPU memory fragmentation patterns under the new dynamic allocation strategy
  3. Latency Monitoring: Measure 95th/99th percentile inference latencies to detect allocation-induced spikes

Medium-Term Optimizations

  1. Buffer Pool Implementation: Add buffer pooling to reduce allocation overhead in Vulkan backend
  2. Allocation Hysteresis: Implement thresholds to prevent oscillating buffer reallocations
  3. Predictive Sizing: Use historical allocation patterns to better predict buffer requirements

Investigation Areas

  1. Compiler Analysis: The unicode function degradations warrant compiler flag verification between builds
  2. Vulkan Backend Testing: Comprehensive testing of edge cases in dynamic allocation scenarios
  3. Memory Usage Patterns: Analysis of peak memory usage under new allocation strategy

Overall Assessment

Change Impact Evaluation

Positive Aspects:

Risk Factors:

  • Increased complexity in Vulkan memory management introduces potential performance variability
  • Dynamic allocation strategy may cause unpredictable latency patterns
  • More complex code paths increase maintenance burden for GPU-specific optimizations

Maintainability and Future Considerations

Code Quality: The PR maintains good separation of concerns by isolating changes to the Vulkan backend. However, the shift from predictable to dynamic resource allocation increases cognitive complexity for future maintainers.

Performance Predictability: While average performance improves, the new allocation strategy may introduce performance variability that could affect real-time inference applications.

Scalability: The dynamic allocation approach may not scale well under high-concurrency scenarios where multiple inference requests compete for GPU resources.

Recommendation: The changes are technically sound and provide measurable benefits. However, production deployment should include comprehensive monitoring of allocation patterns and latency distributions to ensure the performance improvements are sustained across diverse workloads without introducing unacceptable variability.

@DajanaV DajanaV force-pushed the main branch 2 times, most recently from 1983956 to 326a60a Compare October 29, 2025 12:13
@DajanaV DajanaV added the dev-stale Stale dev environment — dashboard not accessible label Oct 30, 2025
@DajanaV DajanaV closed this Oct 30, 2025
@DajanaV DajanaV deleted the upstream-PR16826-branch_jeffbolznv-dryrun branch October 30, 2025 15:25
@DajanaV DajanaV deleted the branch main October 30, 2025 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev-stale Stale dev environment — dashboard not accessible

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants