Misc. bug:  Under-Performance of Linux ROCm 7.2 binaries

### Name and Version

 b8180

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-bench, llama-server, llama-cli

### Command line

```shell
$ ROCBLAS_USE_HIPBLASLT=0 ./llama-bench --mmap 0 -fa 1 -b 1024 -ub 2048 -d 32768,65536 -m ~/.cache/huggingface/models/unsloth/Qwen3-Coder-Next-GGUF/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf
```

### Problem description & steps to reproduce

The test platform is an Strix Halo Laptop ( Zbook Ultra G1a, 128GB Unified Memory ) using Ubuntu 24.04 and kernel 6.19.4  with ROCm 7.2

When running a self-compiled version of llama.cpp b8180 with the stated command line I got the following results:

| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  pp512 @ d32768 |        343.99 ± 1.68 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  tg128 @ d32768 |         24.21 ± 0.41 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  pp512 @ d65536 |        251.15 ± 8.99 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  tg128 @ d65536 |         22.09 ± 0.07 |

While when I run the same using the ROCm 7.2 binaries directly downloaded from the Releases This is what I got:

| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  pp512 @ d32768 |         93.26 ± 1.15 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  tg128 @ d32768 |         22.42 ± 0.32 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  pp512 @ d65536 |         49.72 ± 1.26 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  tg128 @ d65536 |         18.38 ± 0.33 |

There is a known performance regression previously observed in ROCm 7+ builds (compared to ROCm 6.4.4) has been resolved via a workaround.

The issue was caused by a compiler regression (https://github.com/llvm/llvm-project/pull/147700) affecting loop unrolling thresholds. I have applied the workaround (-mllvm --amdgpu-unroll-threshold-local=600) in the builds, restoring full performance.

This workaround will be removed once the upstream fix lands. For details, see the issue: https://github.com/kyuz0/amd-strix-halo-toolboxes/issues/45

Still, I do not think it explains everything, these are my compilation flags:
cmake -S . \
  -B build \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS=gfx1151 \
  -DGPU_TARGETS=gfx1151 \
  -DGGML_HIP_UMA=OFF \
  -DGGML_HIP_ROCWMMA_FATTN=OFF \
  -DGGML_HIP_GRAPHS=ON \
  -DROCM_PATH=/opt/rocm \
  -DHIP_PLATFORM=amd \
  -DCMAKE_HIP_FLAGS="--rocm-path=/opt/rocm -mllvm --amdgpu-unroll-threshold-local=600" \
  -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=ON \
  -DGGML_CUDA_FA=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_CUDA_FORCE_MMQ=OFF \
  -DGGML_CUDA_FORCE_CUBLAS=OFF \  
  -DGGML_NATIVE=ON \
  -DGGML_OPENMP=ON \
  -DCMAKE_CXX_FLAGS="-I$HIP_INCLUDE_PATH" \
  -DCMAKE_C_FLAGS="-I$HIP_INCLUDE_PATH" \
  -DGGML_RPC=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DCMAKE_BUILD_TYPE=Release \

I hope this can help. Thanks a lot for this great software :)

Kind Regards
Franky

### First Bad Commit

I've observed this bad performance since the first ROCm 7.2 build

### Relevant log output

<details>
<summary>Logs</summary>


```console
Self-Compiled llama.cpp:
ROCBLAS_USE_HIPBLASLT=0 llama-bench --mmap 0 -fa 1 -b 1024 -ub 2048 -d 32768,65536 -m ~/.cache/huggingface/models/unsloth/Qwen3-Coder-Next-GGUF/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf  
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  pp512 @ d32768 |        343.99 ± 1.68 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  tg128 @ d32768 |         24.21 ± 0.41 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  pp512 @ d65536 |        251.15 ± 8.99 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  tg128 @ d65536 |         22.09 ± 0.07 |

build: d979f2b17 (8180)

Release b180 ROCm 7.2 binary:
ROCBLAS_USE_HIPBLASLT=0 ./llama-bench --mmap 0 -fa 1 -b 1024 -ub 2048 -d 32768,65536 -m ~/.cache/huggingface/models/unsloth/Qwen3-Coder-Next-GGUF/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf  
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
load_backend: loaded ROCm backend from /home/franky/Downloads/llama-b8180-bin-ubuntu-rocm-7.2-x64/llama-b8180/libggml-hip.so
load_backend: loaded RPC backend from /home/franky/Downloads/llama-b8180-bin-ubuntu-rocm-7.2-x64/llama-b8180/libggml-rpc.so
load_backend: loaded CPU backend from /home/franky/Downloads/llama-b8180-bin-ubuntu-rocm-7.2-x64/llama-b8180/libggml-cpu-zen4.so
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  pp512 @ d32768 |         93.26 ± 1.15 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  tg128 @ d32768 |         22.42 ± 0.32 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  pp512 @ d65536 |         49.72 ± 1.26 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  tg128 @ d65536 |         18.38 ± 0.33 |

build: d979f2b17 (8180)
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Under-Performance of Linux ROCm 7.2 binaries #19984

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	size	params	backend	ngl	n_batch	n_ubatch	fa	test	t/s
qwen3next 80B.A3B Q8_0	78.98 GiB	79.67 B	ROCm	99	1024	2048	1	pp512 @ d32768	343.99 ± 1.68
qwen3next 80B.A3B Q8_0	78.98 GiB	79.67 B	ROCm	99	1024	2048	1	tg128 @ d32768	24.21 ± 0.41
qwen3next 80B.A3B Q8_0	78.98 GiB	79.67 B	ROCm	99	1024	2048	1	pp512 @ d65536	251.15 ± 8.99
qwen3next 80B.A3B Q8_0	78.98 GiB	79.67 B	ROCm	99	1024	2048	1	tg128 @ d65536	22.09 ± 0.07

Misc. bug: Under-Performance of Linux ROCm 7.2 binaries #19984

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions