rocm7-nightlies build 8070 (cc45f2ada) has a severe prefill regression.
Setup:
- GMKtec EVO-X2 (Ryzen AI Max+ 395, 128GB)
- Kernel 6.18.7-200.fc43, firmware 20260110
- Model: Qwen3-Coder-30B-A3B-Instruct BF16
Expected (yesterday): ~250 tok/s prefill
Actual (build 8070): ~75 tok/s prefill
Generation speed: 27 tok/s (correct)
Tested with both ROCBLAS_USE_HIPBLASLT=1 and =0, same result.
Command:
llama-server -m model.gguf --jinja --no-mmap -fa 1 -ngl 999 -c 65536
Image digest: sha256:6c1fc630d304c02c9c726a17407923f9951ac8e18e3e8b84f0c1685d0c15c2ba