-
Notifications
You must be signed in to change notification settings - Fork 16.2k
Misc. bug: Under-Performance of Linux ROCm 7.2 binaries #19984
Description
Name and Version
b8180
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-bench, llama-server, llama-cli
Command line
$ ROCBLAS_USE_HIPBLASLT=0 ./llama-bench --mmap 0 -fa 1 -b 1024 -ub 2048 -d 32768,65536 -m ~/.cache/huggingface/models/unsloth/Qwen3-Coder-Next-GGUF/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.ggufProblem description & steps to reproduce
The test platform is an Strix Halo Laptop ( Zbook Ultra G1a, 128GB Unified Memory ) using Ubuntu 24.04 and kernel 6.19.4 with ROCm 7.2
When running a self-compiled version of llama.cpp b8180 with the stated command line I got the following results:
| model | size | params | backend | ngl | n_batch | n_ubatch | fa | test | t/s |
|---|---|---|---|---|---|---|---|---|---|
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 @ d32768 | 343.99 ± 1.68 |
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 @ d32768 | 24.21 ± 0.41 |
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 @ d65536 | 251.15 ± 8.99 |
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 @ d65536 | 22.09 ± 0.07 |
While when I run the same using the ROCm 7.2 binaries directly downloaded from the Releases This is what I got:
| model | size | params | backend | ngl | n_batch | n_ubatch | fa | test | t/s |
|---|---|---|---|---|---|---|---|---|---|
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 @ d32768 | 93.26 ± 1.15 |
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 @ d32768 | 22.42 ± 0.32 |
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 @ d65536 | 49.72 ± 1.26 |
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 @ d65536 | 18.38 ± 0.33 |
There is a known performance regression previously observed in ROCm 7+ builds (compared to ROCm 6.4.4) has been resolved via a workaround.
The issue was caused by a compiler regression (llvm/llvm-project#147700) affecting loop unrolling thresholds. I have applied the workaround (-mllvm --amdgpu-unroll-threshold-local=600) in the builds, restoring full performance.
This workaround will be removed once the upstream fix lands. For details, see the issue: kyuz0/amd-strix-halo-toolboxes#45
Still, I do not think it explains everything, these are my compilation flags:
cmake -S .
-B build
-DGGML_HIP=ON
-DAMDGPU_TARGETS=gfx1151
-DGPU_TARGETS=gfx1151
-DGGML_HIP_UMA=OFF
-DGGML_HIP_ROCWMMA_FATTN=OFF
-DGGML_HIP_GRAPHS=ON
-DROCM_PATH=/opt/rocm
-DHIP_PLATFORM=amd
-DCMAKE_HIP_FLAGS="--rocm-path=/opt/rocm -mllvm --amdgpu-unroll-threshold-local=600"
-DGGML_CUDA_ENABLE_UNIFIED_MEMORY=ON
-DGGML_CUDA_FA=ON
-DGGML_CUDA_FA_ALL_QUANTS=ON
-DGGML_CUDA_FORCE_MMQ=OFF
-DGGML_CUDA_FORCE_CUBLAS=OFF \
-DGGML_NATIVE=ON
-DGGML_OPENMP=ON
-DCMAKE_CXX_FLAGS="-I$HIP_INCLUDE_PATH"
-DCMAKE_C_FLAGS="-I$HIP_INCLUDE_PATH"
-DGGML_RPC=ON
-DLLAMA_BUILD_SERVER=ON
-DCMAKE_BUILD_TYPE=Release \
I hope this can help. Thanks a lot for this great software :)
Kind Regards
Franky
First Bad Commit
I've observed this bad performance since the first ROCm 7.2 build
Relevant log output
Logs
Self-Compiled llama.cpp:
ROCBLAS_USE_HIPBLASLT=0 llama-bench --mmap 0 -fa 1 -b 1024 -ub 2048 -d 32768,65536 -m ~/.cache/huggingface/models/unsloth/Qwen3-Coder-Next-GGUF/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | n_batch | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 @ d32768 | 343.99 ± 1.68 |
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 @ d32768 | 24.21 ± 0.41 |
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 @ d65536 | 251.15 ± 8.99 |
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 @ d65536 | 22.09 ± 0.07 |
build: d979f2b17 (8180)
Release b180 ROCm 7.2 binary:
ROCBLAS_USE_HIPBLASLT=0 ./llama-bench --mmap 0 -fa 1 -b 1024 -ub 2048 -d 32768,65536 -m ~/.cache/huggingface/models/unsloth/Qwen3-Coder-Next-GGUF/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
load_backend: loaded ROCm backend from /home/franky/Downloads/llama-b8180-bin-ubuntu-rocm-7.2-x64/llama-b8180/libggml-hip.so
load_backend: loaded RPC backend from /home/franky/Downloads/llama-b8180-bin-ubuntu-rocm-7.2-x64/llama-b8180/libggml-rpc.so
load_backend: loaded CPU backend from /home/franky/Downloads/llama-b8180-bin-ubuntu-rocm-7.2-x64/llama-b8180/libggml-cpu-zen4.so
| model | size | params | backend | ngl | n_batch | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 @ d32768 | 93.26 ± 1.15 |
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 @ d32768 | 22.42 ± 0.32 |
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 @ d65536 | 49.72 ± 1.26 |
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 @ d65536 | 18.38 ± 0.33 |
build: d979f2b17 (8180)