UPSTREAM PR #17766: metal : attach residency sets to queue#437
UPSTREAM PR #17766: metal : attach residency sets to queue#437
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #437OverviewThis PR introduces Metal GPU memory residency management changes to address macOS 15+ memory throttling behavior. The modifications add 6 lines of code across the Metal backend buffer lifecycle, attaching residency sets to command queues during buffer initialization and removing them during cleanup. Performance Metrics AnalysisBased on the comprehensive performance analysis conducted between versions
The analysis system found no functions with performance deltas, indicating the compiled binaries are functionally identical at the measurement granularity. Code ChangesThe PR modifies
Additionally, a new benchmark tool Inference Impact AssessmentTokens Per Second: No impact detected. The core inference functions show no performance changes:
Since these functions maintain identical performance characteristics, tokens per second remains unchanged for the measured workloads. Power Consumption: All binaries show stable power consumption with 0 nJ change, including the most computationally intensive components: The changes operate at the Metal API level during buffer lifecycle events, not on the inference hot path, explaining the absence of measurable overhead in steady-state execution. |
84f6117 to
91eb894
Compare
943ad50 to
87d815e
Compare
Mirrored from ggml-org/llama.cpp#17766
cont #11427
ref #10119
So something changed in MacOS recently because the fix from #11427 no longer works - the memory wiring/unwiring (a.k.a. throttling) after 1 second of being idle is back. Maybe this happened with the update to MacOS Tahoe - not sure.
Here are the results on
master:make -j && ./bin/llama-idle -m ../models/llama-3.1-70b/ggml-model-f16.ggufDetails
And here are the results with this PR:
Details
It seems that attaching the residency sets to the Metal queue mostly eliminates the unwiring of the memory. Although, every now and then, it still seems to occur - not sure if this was the case before on MacOS Sequoia.