-
Notifications
You must be signed in to change notification settings - Fork 15.7k
Description
Name and Version
❯ build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
version: 5392 (c753d7be)
built with cc (GCC) 15.0.1 20250418 (Red Hat 15.0.1-0) for x86_64-redhat-linux
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-bench
Command line
llama.cpp-cpu/build/bin/llama-bench -m ~/models/llama-2-7b.Q4_0.gguf
llama.cpp-vulkan/build/bin/llama-bench -m ~/models/llama-2-7b.Q4_0.gguf
llama.cpp-hip/build/bin/llama-bench -m ~/models/llama-2-7b.Q4_0.ggufProblem description & steps to reproduce
Recently I've been testing a Strix Halo (gfx1151) system and was a bit surprised by how poorly the HIP backend ran. All tests were run with llama-bench built on HEAD (b5392) with the standard TheBloke/Llama-2-7B-GGUF (Q4_0):
| Backend | pp512 (t/s) | tg128 (t/s) |
|---|---|---|
| CPU | 304.42 ± 2.05 | 28.65 ± 0.03 |
| HIP | 348.62 ± 0.35 | 48.70 ± 0.02 |
| Vulkan | 881.38 ± 2.11 | 52.82 ± 0.04 |
The HIP version performs far below what you'd expect in terms of tok/TFLOPS efficiency for prompt processing vs other RDNA3 architectures:
gfx1103Radeon 780M iGPU gets 14.51 tok/TFLOP. At that efficiency you'd expect the about 850 tok/s that the Vulkan backend delivers.gfx1100Radeon 7900 XTX gets 25.12 tok/TFLOP. At that efficiency you'd expect almost 1500 tok/s, almost double what the Vulkan backend delivers, and >4X what the current HIP backend delivers.- HIP pp512 barely beats out CPU backend numbers. I don't have an explanation for this.
- Just for a reference of how bad the HIP performance is, an 18CU M3 Pro has ~12.8 FP16 TFLOPS (4.6X less compute than Strix Halo) and delivers about the same pp512. Lunar Lake Arc 140V has 32 FP16 TFLOPS (almost 1/2 Strix Halo) and has a pp512 of 657 tok/s (1.9X faster)
- With the Vulkan backend pp512 is about the same as an M4 Max and tg128 is about equivalent to an M4 Pro
With monitoring, I've confirmed that both HIP and Vulkan reach the max graphics clock. This is a system running Linux 6.15.0-0.rc3, so should be up to date with the latest AMDGPU drivers.
These results are from a standard llama-bench run. I've tried -fa 1 and a rocWMMA build but they don't make much difference so excluded for clarity.
One interesting observation that may help track a potential regression, when I compile w/ gfx1100 support and run with HSA_OVERRIDE_GFX_VERSION=11.0.0 , the pp512 basically doubles to 598.84 ± 1.41 (this eventually leads to MES/kernel errors so obviously is not recommended for use, just an interesting observation that might help in tracking down the issue).
First Bad Commit
No response