Skip to content

UPSTREAM PR #17766: metal : attach residency sets to queue#437

Open
loci-dev wants to merge 3 commits intomainfrom
upstream-PR17766-branch_ggml-org-gg/idle-new
Open

UPSTREAM PR #17766: metal : attach residency sets to queue#437
loci-dev wants to merge 3 commits intomainfrom
upstream-PR17766-branch_ggml-org-gg/idle-new

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Dec 4, 2025

Mirrored from ggml-org/llama.cpp#17766

cont #11427
ref #10119

So something changed in MacOS recently because the fix from #11427 no longer works - the memory wiring/unwiring (a.k.a. throttling) after 1 second of being idle is back. Maybe this happened with the update to MacOS Tahoe - not sure.

Here are the results on master:

make -j && ./bin/llama-idle -m ../models/llama-3.1-70b/ggml-model-f16.gguf
Details
0.06.147.450 I ggml_metal_init: allocating
0.06.147.477 I ggml_metal_init: found device: Apple M2 Ultra
0.06.147.483 I ggml_metal_init: picking default device: Apple M2 Ultra
0.06.147.486 I ggml_metal_init: use fusion         = true
0.06.147.486 I ggml_metal_init: use concurrency    = true
0.06.147.487 I ggml_metal_init: use graph optimize = true
0.06.147.505 I llama_context:        CPU  output buffer size =     0.49 MiB
0.06.151.959 I llama_kv_cache:      Metal KV buffer size =   160.00 MiB
0.06.159.411 I llama_kv_cache: size =  160.00 MiB (   512 cells,  80 layers,  1/1 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
0.06.160.621 I llama_context: Flash Attention was auto, set to enabled
0.06.170.950 I llama_context:      Metal compute buffer size =   282.50 MiB
0.06.170.952 I llama_context:        CPU compute buffer size =    17.01 MiB
0.06.170.952 I llama_context: graph nodes  = 2487
0.06.170.952 I llama_context: graph splits = 2
  - decode time:   193.05 ms
  - decode time:   192.43 ms
  - decode time:   192.13 ms
  - decode time:   192.62 ms
  - decode time:   193.71 ms
  - decode time:   192.02 ms
  - decode time:   191.88 ms
  - decode time:   193.06 ms
  - decode time:   192.33 ms
  - decode time:   191.80 ms
iters:   10, pause:   200 ms, avg decode time:   192.50 +/- 0.61 ms
  - decode time:   192.67 ms
  - decode time:   192.49 ms
  - decode time:   191.15 ms
  - decode time:   192.62 ms
  - decode time:   192.98 ms
  - decode time:   191.24 ms
  - decode time:   191.12 ms
  - decode time:   191.50 ms
  - decode time:   191.59 ms
  - decode time:   192.20 ms
iters:   10, pause:   400 ms, avg decode time:   191.95 +/- 0.71 ms
  - decode time:   191.18 ms
  - decode time:   191.54 ms
  - decode time:   190.83 ms
  - decode time:   191.12 ms
  - decode time:   190.08 ms
  - decode time:   190.78 ms
  - decode time:   191.07 ms
  - decode time:   191.04 ms
  - decode time:   196.06 ms
  - decode time:   192.10 ms
iters:   10, pause:   600 ms, avg decode time:   191.58 +/- 1.66 ms
  - decode time:   192.49 ms
  - decode time:   191.13 ms
  - decode time:   191.46 ms
  - decode time:   191.73 ms
  - decode time:   191.92 ms
  - decode time:   193.68 ms
  - decode time:   192.04 ms
  - decode time:   191.72 ms
  - decode time:   192.06 ms
  - decode time:   191.81 ms
iters:   10, pause:   800 ms, avg decode time:   192.01 +/- 0.69 ms
  - decode time:   193.58 ms
  - decode time:   194.51 ms
  - decode time:   192.86 ms
  - decode time:   433.47 ms
  - decode time:   190.99 ms
  - decode time:   193.74 ms
  - decode time:   199.78 ms
  - decode time:   190.71 ms
  - decode time:   191.84 ms
  - decode time:   193.31 ms
iters:   10, pause:  1000 ms, avg decode time:   217.48 +/- 75.93 ms
  - decode time:   412.50 ms
  - decode time:   309.28 ms
  - decode time:   360.40 ms
  - decode time:   387.33 ms
  - decode time:   370.75 ms
  - decode time:   376.98 ms
  - decode time:  1377.47 ms
  - decode time:  1300.97 ms
  - decode time:  1298.22 ms
  - decode time:  1327.39 ms
iters:   10, pause:  1200 ms, avg decode time:   752.13 +/- 495.04 ms
  - decode time:  1067.61 ms
  - decode time:  1125.08 ms
  - decode time:  1119.94 ms
  - decode time:  1315.78 ms
  - decode time:  1087.12 ms
  - decode time:  1091.60 ms
  - decode time:  1087.27 ms
  - decode time:  1084.48 ms
  - decode time:  1087.62 ms
  - decode time:  1087.20 ms
iters:   10, pause:  1400 ms, avg decode time:  1115.37 +/- 72.44 ms
  - decode time:  1598.81 ms
  - decode time:  1659.83 ms
  - decode time:  1704.90 ms
  - decode time:  1576.85 ms
  - decode time:  1682.92 ms
  - decode time:  1660.09 ms
  - decode time:  1685.85 ms
  - decode time:  1616.58 ms
  - decode time:  1660.40 ms
  - decode time:  1683.09 ms
iters:   10, pause:  1600 ms, avg decode time:  1652.93 +/- 41.88 ms
  - decode time:  1451.54 ms
  - decode time:  1429.19 ms
  - decode time:  1448.97 ms
  - decode time:  1472.25 ms
  - decode time:  1437.39 ms
  - decode time:  1455.94 ms
  - decode time:  1419.29 ms
  - decode time:  1471.90 ms
  - decode time:  1430.76 ms
  - decode time:  1461.42 ms
iters:   10, pause:  1800 ms, avg decode time:  1447.86 +/- 18.27 ms
2.36.527.021 I ggml_metal_free: deallocating

And here are the results with this PR:

Details
0.06.159.343 I ggml_metal_init: allocating
0.06.159.377 I ggml_metal_init: found device: Apple M2 Ultra
0.06.159.383 I ggml_metal_init: picking default device: Apple M2 Ultra
0.06.159.386 I ggml_metal_init: use fusion         = true
0.06.159.386 I ggml_metal_init: use concurrency    = true
0.06.159.387 I ggml_metal_init: use graph optimize = true
0.06.159.403 I llama_context:        CPU  output buffer size =     0.49 MiB
0.06.164.393 I llama_kv_cache:      Metal KV buffer size =   160.00 MiB
0.06.171.847 I llama_kv_cache: size =  160.00 MiB (   512 cells,  80 layers,  1/1 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
0.06.173.129 I llama_context: Flash Attention was auto, set to enabled
0.06.183.473 I llama_context:      Metal compute buffer size =   282.50 MiB
0.06.183.482 I llama_context:        CPU compute buffer size =    17.01 MiB
0.06.183.482 I llama_context: graph nodes  = 2487
0.06.183.484 I llama_context: graph splits = 2
  - decode time:   191.36 ms
  - decode time:   191.88 ms
  - decode time:   192.17 ms
  - decode time:   190.04 ms
  - decode time:   191.18 ms
  - decode time:   191.01 ms
  - decode time:   192.39 ms
  - decode time:   192.18 ms
  - decode time:   193.01 ms
  - decode time:   192.78 ms
iters:   10, pause:   200 ms, avg decode time:   191.80 +/- 0.90 ms
  - decode time:   191.12 ms
  - decode time:   191.23 ms
  - decode time:   191.20 ms
  - decode time:   192.33 ms
  - decode time:   190.71 ms
  - decode time:   192.59 ms
  - decode time:   192.66 ms
  - decode time:   192.14 ms
  - decode time:   190.35 ms
  - decode time:   191.60 ms
iters:   10, pause:   400 ms, avg decode time:   191.59 +/- 0.80 ms
  - decode time:   193.46 ms
  - decode time:   194.52 ms
  - decode time:   191.91 ms
  - decode time:   190.64 ms
  - decode time:   194.11 ms
  - decode time:   191.17 ms
  - decode time:   192.81 ms
  - decode time:   192.62 ms
  - decode time:   192.73 ms
  - decode time:   191.18 ms
iters:   10, pause:   600 ms, avg decode time:   192.51 +/- 1.29 ms
  - decode time:   199.77 ms
  - decode time:   195.51 ms
  - decode time:   192.30 ms
  - decode time:   190.35 ms
  - decode time:   193.33 ms
  - decode time:   190.30 ms
  - decode time:   193.18 ms
  - decode time:   192.25 ms
  - decode time:   192.48 ms
  - decode time:   193.67 ms
iters:   10, pause:   800 ms, avg decode time:   193.31 +/- 2.74 ms
  - decode time:   193.74 ms
  - decode time:   194.80 ms
  - decode time:   192.59 ms
  - decode time:   192.08 ms
  - decode time:   194.69 ms
  - decode time:   191.29 ms
  - decode time:   191.55 ms
  - decode time:   196.23 ms
  - decode time:   191.95 ms
  - decode time:   192.46 ms
iters:   10, pause:  1000 ms, avg decode time:   193.14 +/- 1.64 ms
  - decode time:   221.25 ms
  - decode time:   222.57 ms
  - decode time:   219.05 ms
  - decode time:   218.60 ms
  - decode time:   191.75 ms
  - decode time:   223.81 ms
  - decode time:   217.56 ms
  - decode time:   192.10 ms
  - decode time:   219.01 ms
  - decode time:   217.57 ms
iters:   10, pause:  1200 ms, avg decode time:   214.33 +/- 11.99 ms
  - decode time:   227.78 ms
  - decode time:   218.82 ms
  - decode time:   212.44 ms
  - decode time:   222.10 ms
  - decode time:   225.00 ms
  - decode time:   229.71 ms
  - decode time:   226.76 ms
  - decode time:   218.06 ms
  - decode time:   222.88 ms
  - decode time:   220.53 ms
iters:   10, pause:  1400 ms, avg decode time:   222.41 +/- 5.19 ms
  - decode time:   226.02 ms
  - decode time:   227.07 ms
  - decode time:   224.31 ms
  - decode time:   221.87 ms
  - decode time:   224.10 ms
  - decode time:   226.71 ms
  - decode time:   225.18 ms
  - decode time:   228.20 ms
  - decode time:   224.98 ms
  - decode time:   221.39 ms
iters:   10, pause:  1600 ms, avg decode time:   224.98 +/- 2.18 ms
  - decode time:   225.10 ms
  - decode time:   225.45 ms
  - decode time:   229.04 ms
  - decode time:   229.66 ms
  - decode time:   229.84 ms
  - decode time:   221.65 ms
  - decode time:   232.24 ms
  - decode time:   229.59 ms
  - decode time:   227.63 ms
  - decode time:   227.49 ms
iters:   10, pause:  1800 ms, avg decode time:   227.77 +/- 3.03 ms
1.55.608.188 I ggml_metal_free: deallocating

It seems that attaching the residency sets to the Metal queue mostly eliminates the unwiring of the memory. Although, every now and then, it still seems to occur - not sure if this was the case before on MacOS Sequoia.

@loci-review
Copy link
Copy Markdown

loci-review bot commented Dec 4, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #437

Overview

This PR introduces Metal GPU memory residency management changes to address macOS 15+ memory throttling behavior. The modifications add 6 lines of code across the Metal backend buffer lifecycle, attaching residency sets to command queues during buffer initialization and removing them during cleanup.

Performance Metrics Analysis

Based on the comprehensive performance analysis conducted between versions 40aed7b4-fc7b-47a3-8cd9-582e6b8400fb (target) and 24867642-2217-4c9e-b836-bb2f6ee264ff (baseline), no measurable performance changes were detected at the function level or binary level:

  • Response Time Changes: 0 ns across all measured functions
  • Throughput Time Changes: 0 ns across all measured functions
  • Power Consumption Changes: 0 nJ across all 16 analyzed binaries

The analysis system found no functions with performance deltas, indicating the compiled binaries are functionally identical at the measurement granularity.

Code Changes

The PR modifies ggml-metal-device.m with three targeted additions:

  • ggml_metal_buffer_init: Adds [res->queue addResidencySet:res->rset]
  • ggml_metal_buffer_map: Adds [res->queue addResidencySet:res->rset]
  • ggml_metal_buffer_free: Adds [buf->queue removeResidencySet:buf->rset]

Additionally, a new benchmark tool llama-idle was added to validate the fix effectiveness under idle GPU conditions.

Inference Impact Assessment

Tokens Per Second: No impact detected. The core inference functions show no performance changes:

  • llama_decode: 0 ns change in response time and throughput
  • llama_encode: 0 ns change in response time and throughput
  • llama_tokenize: 0 ns change in response time and throughput

Since these functions maintain identical performance characteristics, tokens per second remains unchanged for the measured workloads.

Power Consumption: All binaries show stable power consumption with 0 nJ change, including the most computationally intensive components: llama-tts (253,822 nJ), llama-cvector-generator (249,105 nJ), llama-run (218,706 nJ), and libllama.so (194,028 nJ).

The changes operate at the Metal API level during buffer lifecycle events, not on the inference hot path, explaining the absence of measurable overhead in steady-state execution.

@loci-dev loci-dev force-pushed the main branch 25 times, most recently from 84f6117 to 91eb894 Compare December 7, 2025 22:08
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 943ad50 to 87d815e Compare December 13, 2025 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants