Skip to content

Misc. bug: HIP backend performs poorly on AMD Ryzen AI MAX 395 (Strix Halo gfx1151) #13565

@lhl

Description

@lhl

Name and Version

❯ build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
version: 5392 (c753d7be)
built with cc (GCC) 15.0.1 20250418 (Red Hat 15.0.1-0) for x86_64-redhat-linux

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-bench

Command line

llama.cpp-cpu/build/bin/llama-bench -m ~/models/llama-2-7b.Q4_0.gguf
llama.cpp-vulkan/build/bin/llama-bench -m ~/models/llama-2-7b.Q4_0.gguf
llama.cpp-hip/build/bin/llama-bench -m ~/models/llama-2-7b.Q4_0.gguf

Problem description & steps to reproduce

Recently I've been testing a Strix Halo (gfx1151) system and was a bit surprised by how poorly the HIP backend ran. All tests were run with llama-bench built on HEAD (b5392) with the standard TheBloke/Llama-2-7B-GGUF (Q4_0):

Backend pp512 (t/s) tg128 (t/s)
CPU 304.42 ± 2.05 28.65 ± 0.03
HIP 348.62 ± 0.35 48.70 ± 0.02
Vulkan 881.38 ± 2.11 52.82 ± 0.04

The HIP version performs far below what you'd expect in terms of tok/TFLOPS efficiency for prompt processing vs other RDNA3 architectures:

  • gfx1103 Radeon 780M iGPU gets 14.51 tok/TFLOP. At that efficiency you'd expect the about 850 tok/s that the Vulkan backend delivers.
  • gfx1100 Radeon 7900 XTX gets 25.12 tok/TFLOP. At that efficiency you'd expect almost 1500 tok/s, almost double what the Vulkan backend delivers, and >4X what the current HIP backend delivers.
  • HIP pp512 barely beats out CPU backend numbers. I don't have an explanation for this.
  • Just for a reference of how bad the HIP performance is, an 18CU M3 Pro has ~12.8 FP16 TFLOPS (4.6X less compute than Strix Halo) and delivers about the same pp512. Lunar Lake Arc 140V has 32 FP16 TFLOPS (almost 1/2 Strix Halo) and has a pp512 of 657 tok/s (1.9X faster)
  • With the Vulkan backend pp512 is about the same as an M4 Max and tg128 is about equivalent to an M4 Pro

With monitoring, I've confirmed that both HIP and Vulkan reach the max graphics clock. This is a system running Linux 6.15.0-0.rc3, so should be up to date with the latest AMDGPU drivers.

These results are from a standard llama-bench run. I've tried -fa 1 and a rocWMMA build but they don't make much difference so excluded for clarity.

One interesting observation that may help track a potential regression, when I compile w/ gfx1100 support and run with HSA_OVERRIDE_GFX_VERSION=11.0.0 , the pp512 basically doubles to 598.84 ± 1.41 (this eventually leads to MES/kernel errors so obviously is not recommended for use, just an interesting observation that might help in tracking down the issue).

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    AMD GPUIssues specific to AMD GPUsperformanceSpeed related topics

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions