Misc. bug: HIP backend performs poorly on AMD Ryzen AI MAX 395 (Strix Halo gfx1151)

### Name and Version

```
❯ build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
version: 5392 (c753d7be)
built with cc (GCC) 15.0.1 20250418 (Red Hat 15.0.1-0) for x86_64-redhat-linux
```

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-bench

### Command line

```shell
llama.cpp-cpu/build/bin/llama-bench -m ~/models/llama-2-7b.Q4_0.gguf
llama.cpp-vulkan/build/bin/llama-bench -m ~/models/llama-2-7b.Q4_0.gguf
llama.cpp-hip/build/bin/llama-bench -m ~/models/llama-2-7b.Q4_0.gguf
```

### Problem description & steps to reproduce

Recently I've been testing a Strix Halo (gfx1151) system and was a bit surprised by how poorly the HIP backend ran. All tests were run with `llama-bench` built on HEAD (b5392) with the standard [TheBloke/Llama-2-7B-GGUF](https://huggingface.co/TheBloke/Llama-2-7B-GGUF) (Q4_0):

| Backend|pp512 (t/s)|tg128 (t/s)|
|:-|:-|:-|
|CPU|304.42 ± 2.05|28.65 ± 0.03|
|HIP|348.62 ± 0.35|48.70 ± 0.02|
|Vulkan|881.38 ± 2.11|52.82 ± 0.04|

The HIP version performs far below what you'd expect in terms of tok/TFLOPS efficiency for prompt processing vs other RDNA3 architectures:

- `gfx1103` Radeon 780M iGPU gets 14.51 tok/TFLOP. At that efficiency you'd expect the about 850 tok/s that the Vulkan backend delivers.
- `gfx1100` Radeon 7900 XTX gets 25.12 tok/TFLOP. At that efficiency you'd expect almost 1500 tok/s, almost double what the Vulkan backend delivers, and >4X what the current HIP backend delivers.
- HIP pp512 barely beats out CPU backend numbers. I don't have an explanation for this.
- Just for a reference of how bad the HIP performance is, an 18CU M3 Pro has \~12.8 FP16 TFLOPS (4.6X less compute than Strix Halo) and delivers about the same pp512. Lunar Lake Arc 140V has 32 FP16 TFLOPS (almost 1/2 Strix Halo) and has a pp512 of 657 tok/s (1.9X faster)
- With the Vulkan backend pp512 is about the same as an M4 Max and tg128 is about equivalent to an M4 Pro

With monitoring, I've confirmed that both HIP and Vulkan reach the max graphics clock. This is a system running Linux 6.15.0-0.rc3, so should be up to date with the latest AMDGPU drivers.

These results are from a standard `llama-bench` run. I've tried `-fa 1` and a rocWMMA build but they don't make much difference so excluded for clarity.

One interesting observation that may help track a potential regression, when I compile w/ gfx1100 support and run with HSA_OVERRIDE_GFX_VERSION=11.0.0 , the pp512 basically doubles to 598.84 ± 1.41 (this eventually leads to MES/kernel errors so obviously is not recommended for use, just an interesting observation that might help in tracking down the issue).

### First Bad Commit

_No response_

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: HIP backend performs poorly on AMD Ryzen AI MAX 395 (Strix Halo gfx1151) #13565

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Backend	pp512 (t/s)	tg128 (t/s)
CPU	304.42 ± 2.05	28.65 ± 0.03
HIP	348.62 ± 0.35	48.70 ± 0.02
Vulkan	881.38 ± 2.11	52.82 ± 0.04

Misc. bug: HIP backend performs poorly on AMD Ryzen AI MAX 395 (Strix Halo gfx1151) #13565

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions