Skip to content

Misc. bug: Under-Performance of Linux ROCm 7.2 binaries #19984

@frankyriventek

Description

@frankyriventek

Name and Version

b8180

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-bench, llama-server, llama-cli

Command line

$ ROCBLAS_USE_HIPBLASLT=0 ./llama-bench --mmap 0 -fa 1 -b 1024 -ub 2048 -d 32768,65536 -m ~/.cache/huggingface/models/unsloth/Qwen3-Coder-Next-GGUF/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf

Problem description & steps to reproduce

The test platform is an Strix Halo Laptop ( Zbook Ultra G1a, 128GB Unified Memory ) using Ubuntu 24.04 and kernel 6.19.4 with ROCm 7.2

When running a self-compiled version of llama.cpp b8180 with the stated command line I got the following results:

model size params backend ngl n_batch n_ubatch fa test t/s
qwen3next 80B.A3B Q8_0 78.98 GiB 79.67 B ROCm 99 1024 2048 1 pp512 @ d32768 343.99 ± 1.68
qwen3next 80B.A3B Q8_0 78.98 GiB 79.67 B ROCm 99 1024 2048 1 tg128 @ d32768 24.21 ± 0.41
qwen3next 80B.A3B Q8_0 78.98 GiB 79.67 B ROCm 99 1024 2048 1 pp512 @ d65536 251.15 ± 8.99
qwen3next 80B.A3B Q8_0 78.98 GiB 79.67 B ROCm 99 1024 2048 1 tg128 @ d65536 22.09 ± 0.07

While when I run the same using the ROCm 7.2 binaries directly downloaded from the Releases This is what I got:

model size params backend ngl n_batch n_ubatch fa test t/s
qwen3next 80B.A3B Q8_0 78.98 GiB 79.67 B ROCm 99 1024 2048 1 pp512 @ d32768 93.26 ± 1.15
qwen3next 80B.A3B Q8_0 78.98 GiB 79.67 B ROCm 99 1024 2048 1 tg128 @ d32768 22.42 ± 0.32
qwen3next 80B.A3B Q8_0 78.98 GiB 79.67 B ROCm 99 1024 2048 1 pp512 @ d65536 49.72 ± 1.26
qwen3next 80B.A3B Q8_0 78.98 GiB 79.67 B ROCm 99 1024 2048 1 tg128 @ d65536 18.38 ± 0.33

There is a known performance regression previously observed in ROCm 7+ builds (compared to ROCm 6.4.4) has been resolved via a workaround.

The issue was caused by a compiler regression (llvm/llvm-project#147700) affecting loop unrolling thresholds. I have applied the workaround (-mllvm --amdgpu-unroll-threshold-local=600) in the builds, restoring full performance.

This workaround will be removed once the upstream fix lands. For details, see the issue: kyuz0/amd-strix-halo-toolboxes#45

Still, I do not think it explains everything, these are my compilation flags:
cmake -S .
-B build
-DGGML_HIP=ON
-DAMDGPU_TARGETS=gfx1151
-DGPU_TARGETS=gfx1151
-DGGML_HIP_UMA=OFF
-DGGML_HIP_ROCWMMA_FATTN=OFF
-DGGML_HIP_GRAPHS=ON
-DROCM_PATH=/opt/rocm
-DHIP_PLATFORM=amd
-DCMAKE_HIP_FLAGS="--rocm-path=/opt/rocm -mllvm --amdgpu-unroll-threshold-local=600"
-DGGML_CUDA_ENABLE_UNIFIED_MEMORY=ON
-DGGML_CUDA_FA=ON
-DGGML_CUDA_FA_ALL_QUANTS=ON
-DGGML_CUDA_FORCE_MMQ=OFF
-DGGML_CUDA_FORCE_CUBLAS=OFF \
-DGGML_NATIVE=ON
-DGGML_OPENMP=ON
-DCMAKE_CXX_FLAGS="-I$HIP_INCLUDE_PATH"
-DCMAKE_C_FLAGS="-I$HIP_INCLUDE_PATH"
-DGGML_RPC=ON
-DLLAMA_BUILD_SERVER=ON
-DCMAKE_BUILD_TYPE=Release \

I hope this can help. Thanks a lot for this great software :)

Kind Regards
Franky

First Bad Commit

I've observed this bad performance since the first ROCm 7.2 build

Relevant log output

Logs
Self-Compiled llama.cpp:
ROCBLAS_USE_HIPBLASLT=0 llama-bench --mmap 0 -fa 1 -b 1024 -ub 2048 -d 32768,65536 -m ~/.cache/huggingface/models/unsloth/Qwen3-Coder-Next-GGUF/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf  
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  pp512 @ d32768 |        343.99 ± 1.68 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  tg128 @ d32768 |         24.21 ± 0.41 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  pp512 @ d65536 |        251.15 ± 8.99 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  tg128 @ d65536 |         22.09 ± 0.07 |

build: d979f2b17 (8180)

Release b180 ROCm 7.2 binary:
ROCBLAS_USE_HIPBLASLT=0 ./llama-bench --mmap 0 -fa 1 -b 1024 -ub 2048 -d 32768,65536 -m ~/.cache/huggingface/models/unsloth/Qwen3-Coder-Next-GGUF/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf  
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
load_backend: loaded ROCm backend from /home/franky/Downloads/llama-b8180-bin-ubuntu-rocm-7.2-x64/llama-b8180/libggml-hip.so
load_backend: loaded RPC backend from /home/franky/Downloads/llama-b8180-bin-ubuntu-rocm-7.2-x64/llama-b8180/libggml-rpc.so
load_backend: loaded CPU backend from /home/franky/Downloads/llama-b8180-bin-ubuntu-rocm-7.2-x64/llama-b8180/libggml-cpu-zen4.so
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  pp512 @ d32768 |         93.26 ± 1.15 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  tg128 @ d32768 |         22.42 ± 0.32 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  pp512 @ d65536 |         49.72 ± 1.26 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       |  99 |    1024 |     2048 |  1 |  tg128 @ d65536 |         18.38 ± 0.33 |

build: d979f2b17 (8180)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions