UPSTREAM PR #16827: Massively Improved ROCm/HIP rocWMMA Performance (pp and tg) by DajanaV · Pull Request #12 · auroralabs-loci/llama.cpp

DajanaV · 2025-10-29T05:05:45Z

In the HIP BUILD docs -DGGML_HIP_ROCWMMA_FATTN=ON is recommended for improved FA performance for RDNA3+/CDNA and in broad pp512/tg128 performance testing it is usually the best option, but some users have noticed there is severe performance degradation, especially with decode (tg) as context gets longer.

I noticed too, and while I wwas doing some other spelunking, found what seemed like some relatively easy wins. There was a bit more fussing than I expected but ended up with a relatively clean patch that both fixes the long context tg regression and also optimizes the WMMA path for RDNA.

Dramatically improve long context WMMA prefill improvements on RDNA3: increased HIP occupancy and reduced LDS footprint via adaptive KQ stride; pp speedups without touching CUDA or the deprecated Volta WMMA path.
Fix long‑context decode regression on rocWMMA builds: decode now uses HIP’s tuned VEC/TILE selection instead of WMMA, aligning performance with the HIP baseline.
Remove HIP‑side TILE pruning in WMMA builds: matches HIP‑only behavior and avoids device traps, binary growth for all tiles was neglible, ~+4 MiB to the build
Add a decode‑time (HIP+rocWMMA only) safety guard: if a predicted TILE split has no config, fall back to VEC. This guard is not present in HIP‑only builds but seemed like a good idea to and avoid crashes on unusual dims.
Changes are gated to ROCWMMA/HIP only; no impact to CUDA or the legacy Volta WMMA path.

The perf improvements are non-trivial and since the changes are all isolated, hopefully it won't be too hard to merge. Here's some performance testing on my Strix Halo (RDNA3.5) w/ ROCm 7.10.0a20251018:

Llama 3.2 1B Q4_K_M

Previous rocWMMA vs HIP

Prefill (pp)

model	size	params	test	HIP	WMMA	Δ%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512	4703.28	4884.42	3.85%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d1024	4076.03	4204.81	3.16%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d4096	2936.89	2959.54	0.77%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d16384	1350.48	1265.62	-6.28%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d65536	424.76	360.24	-15.19%

Decode (tg)

model	size	params	test	HIP	WMMA	Δ%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128	195.65	193.01	-1.35%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d1024	188.79	182.6	-3.28%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d4096	173.36	143.51	-17.22%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d16384	126.86	87.53	-31.01%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d65536	64.62	27.35	-57.68%

My rocWMMA vs HIP

Prefill (pp)

model	size	params	test	HIP	lhl-tune-tile	Δ%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512	4703.28	4970.14	5.67%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d1024	4076.03	4575.18	12.25%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d4096	2936.89	3788.92	29.01%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d16384	1350.48	2064.78	52.89%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d65536	424.76	706.46	66.32%

Decode (tg)

model	size	params	test	HIP	lhl-tune-tile	Δ%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128	195.65	195.59	-0.03%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d1024	188.79	188.84	0.03%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d4096	173.36	173.28	-0.05%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d16384	126.86	127.01	0.12%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d65536	64.62	64.55	-0.10%

My rocWMMA vs Previous rocWMMA

Prefill (pp)

model	size	params	test	default-rocwmma	lhl-tune-tile	Δ%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512	4884.42	4970.14	1.75%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d1024	4204.81	4575.18	8.81%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d4096	2959.54	3788.92	28.02%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d16384	1265.62	2064.78	63.14%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	pp512 @ d65536	360.24	706.46	96.11%

Decode (tg)

model	size	params	test	default-rocwmma	lhl-tune-tile	Δ%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128	193.01	195.59	1.34%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d1024	182.6	188.84	3.42%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d4096	143.51	173.28	20.74%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d16384	87.53	127.01	45.11%
llama 1B Q4_K - Medium	762.81 MiB	1.24 B	tg128 @ d65536	27.35	64.55	136.06%

gpt-oss-20b F16/MXFP4

Previous rocWMMA vs HIP

Prefill (pp)

model	size	params	test	HIP	WMMA	Δ%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512	1472.01	1513.79	2.84%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d1024	1387.58	1417.45	2.15%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d4096	1175.72	1205.37	2.52%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d16384	713.9	669.77	-6.18%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d65536	277.58	227.24	-18.14%

Decode (tg)

model	size	params	test	HIP	WMMA	Δ%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128	49.92	50.23	0.61%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d1024	49.27	48.65	-1.26%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d4096	48.15	45.11	-6.32%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d16384	44.38	32.91	-25.85%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d65536	34.76	14.63	-57.92%

My rocWMMA vs HIP

Prefill (pp)

model	size	params	test	HIP	lhl-tune-tile	Δ%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512	1472.01	1495.97	1.63%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d1024	1387.58	1456.15	4.94%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d4096	1175.72	1347.75	14.63%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d16384	713.9	962.98	34.89%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d65536	277.58	426.81	53.76%

Decode (tg)

model	size	params	test	HIP	lhl-tune-tile	Δ%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128	49.92	49.9	-0.04%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d1024	49.27	49.21	-0.11%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d4096	48.15	48.05	-0.20%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d16384	44.38	44.34	-0.11%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d65536	34.76	34.77	0.03%

My rocWMMA vs Previous rocWMMA

Prefill (pp)

model	size	params	test	default-rocwmma	lhl-tune-tile	Δ%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512	1513.79	1495.97	-1.18%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d1024	1417.45	1456.15	2.73%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d4096	1205.37	1347.75	11.81%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d16384	669.77	962.98	43.78%
gpt-oss 20B F16	13141.28 MiB	20.91 B	pp512 @ d65536	227.24	426.81	87.83%

Decode (tg)

model	size	params	test	default-rocwmma	lhl-tune-tile	Δ%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128	50.23	49.9	-0.64%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d1024	48.65	49.21	1.16%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d4096	45.11	48.05	6.53%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d16384	32.91	44.34	34.72%
gpt-oss 20B F16	13141.28 MiB	20.91 B	tg128 @ d65536	14.63	34.77	137.71%

I only tested small models while I was deving, but am running gpt-oss-120b overnight, since llama 3.2b dense and gpt-oss-20b moe have similar gains, expecting something not so different as context grows...

…idency on HIP via __launch_bounds__ (min 2 blocks/SM)\n- Adaptive KQ stride on HIP: 128 for D<=128 to reduce LDS footprint\n- Update loops and launch to use the adaptive stride; bump nwarps for small D\n- No behavior change on CUDA; improves prefill perf on RDNA3

…E and adding a safe fallback\n\n- Do not select WMMA for decode on HIP; fall through to VEC/TILE\n- Remove WMMA TILE pruning on HIP to avoid device traps; keep for CUDA WMMA\n- Add decode-time guard: if predicted TILE split has no config, select VEC\n- Remove ad-hoc env overrides and debug prints

loci-review-dev · 2025-10-29T06:15:24Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: ROCm/HIP rocWMMA Optimization (PR #12)

Key Findings

Performance Impact Analysis

Minimal Core Degradation: The worst Response Time degradation (0.066%) occurs in _Vector_impl_data@plt, a PLT stub function unrelated to the actual code changes
Non-Critical Function Impact: Performance degradations affect STL template instantiation and dynamic linking overhead, not llama.cpp's core inference functions (llama_encode/decode, attention mechanisms, or model loading)
Power Consumption: Negligible change across all binaries (-0.0% for libllama.so), indicating optimization changes don't impact overall energy efficiency

Flame Graph & CFG Analysis

PLT Overhead Isolation: Flame graph reveals the degraded function represents only dynamic symbol resolution (7 ns total), not actual computation
Identical Assembly Code: CFG comparison shows byte-for-byte identical assembly between versions, confirming performance difference stems from binary layout changes rather than code generation
Infrastructure-Level Impact: The 0.066% degradation likely results from symbol table reorganization or GOT section alignment changes

Code Review Critical Insights

Significant Performance Gains: The PR delivers substantial improvements for ROCm/HIP users (up to 136% decode speedup at long contexts, 96% prefill improvement)
Platform-Specific Optimizations:
- Adaptive KQ stride reduces memory footprint for small head dimensions (D ≤ 128)
- Increased GPU occupancy (2x blocks per SM) optimizes RDNA3+ utilization
- Decode path now uses tuned VEC/TILE kernels instead of underperforming WMMA
Architecture Isolation: Changes are properly gated to HIP/ROCm builds, preserving CUDA behavior

Risk Assessment

Code Complexity Increase: Added 53 lines of kernel selection prediction logic that duplicates existing selection algorithms
Maintenance Risk: Configuration prediction must stay synchronized with actual kernel selection logic
Binary Size Growth: ~4 MiB increase for HIP builds due to disabled kernel pruning (acceptable trade-off for stability)

Overall Assessment

Impact Evaluation

The changes represent high-value, low-risk optimization for the llama.cpp codebase:

Core Function Preservation: No impact on critical inference paths (attention mechanisms, model loading, tokenization)
Targeted Improvements: Addresses specific ROCm/HIP performance bottlenecks without affecting CUDA or CPU backends
Measurable Benefits: Delivers substantial performance gains (20-136% in decode scenarios) for affected hardware

Maintainability Considerations

Platform Fragmentation: Introduces HIP-specific code paths that require specialized testing and validation
Technical Debt: Kernel selection prediction logic creates maintenance overhead and potential synchronization issues
Documentation Need: Complex optimization logic requires comprehensive documentation for future maintainers

Future Performance Outlook

Scalability: Adaptive stride and occupancy optimizations position the codebase well for future RDNA architectures
Extensibility: The gated optimization pattern provides a template for future hardware-specific enhancements
Monitoring Requirements: Performance regression testing should include long-context decode scenarios across GPU architectures

Recommendation

Approve with monitoring: The PR delivers significant performance improvements for ROCm users while maintaining stability for other platforms. The minimal core degradation (0.066% PLT overhead) is acceptable given the substantial gains achieved. Implement performance monitoring for decode scenarios and consider refactoring kernel selection logic in future iterations to reduce maintenance complexity.

Priority: The changes address a critical performance regression in ROCm builds and should be merged to restore competitive performance for AMD GPU users in the llama.cpp ecosystem.

lhl added 2 commits October 28, 2025 17:33

DajanaV force-pushed the main branch 2 times, most recently from 1983956 to 326a60a Compare October 29, 2025 12:13

DajanaV added the dev-stale Stale dev environment — dashboard not accessible label Oct 30, 2025

DajanaV deleted the branch main October 30, 2025 15:25

DajanaV closed this Oct 30, 2025

DajanaV deleted the upstream-PR16827-branch_lhl-rocm-wmma-tune branch October 30, 2025 15:25

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Open

loci-dev mentioned this pull request Mar 21, 2026

UPSTREAM PR #17342: Throughput improvement for small batch sizes #1279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #16827: Massively Improved ROCm/HIP rocWMMA Performance (pp and tg)#12

UPSTREAM PR #16827: Massively Improved ROCm/HIP rocWMMA Performance (pp and tg)#12
DajanaV wants to merge 2 commits intomainfrom
upstream-PR16827-branch_lhl-rocm-wmma-tune

DajanaV commented Oct 29, 2025

Uh oh!

loci-review-dev bot commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DajanaV commented Oct 29, 2025

Llama 3.2 1B Q4_K_M

Previous rocWMMA vs HIP

My rocWMMA vs HIP

My rocWMMA vs Previous rocWMMA

gpt-oss-20b F16/MXFP4

Previous rocWMMA vs HIP

My rocWMMA vs HIP

My rocWMMA vs Previous rocWMMA

Uh oh!

loci-review-dev bot commented Oct 29, 2025

Performance Analysis Summary: ROCm/HIP rocWMMA Optimization (PR #12)

Key Findings

Performance Impact Analysis

Flame Graph & CFG Analysis

Code Review Critical Insights

Risk Assessment

Overall Assessment

Impact Evaluation

Maintainability Considerations

Future Performance Outlook

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants