UPSTREAM PR #16827: Massively Improved ROCm/HIP rocWMMA Performance (pp and tg)#12
Closed
UPSTREAM PR #16827: Massively Improved ROCm/HIP rocWMMA Performance (pp and tg)#12
Conversation
…idency on HIP via __launch_bounds__ (min 2 blocks/SM)\n- Adaptive KQ stride on HIP: 128 for D<=128 to reduce LDS footprint\n- Update loops and launch to use the adaptive stride; bump nwarps for small D\n- No behavior change on CUDA; improves prefill perf on RDNA3
…E and adding a safe fallback\n\n- Do not select WMMA for decode on HIP; fall through to VEC/TILE\n- Remove WMMA TILE pruning on HIP to avoid device traps; keep for CUDA WMMA\n- Add decode-time guard: if predicted TILE split has no config, select VEC\n- Remove ad-hoc env overrides and debug prints
|
Access the complete analysis in the LOCI Dashboard Performance Analysis Summary: ROCm/HIP rocWMMA Optimization (PR #12)Key FindingsPerformance Impact Analysis
Flame Graph & CFG Analysis
Code Review Critical Insights
Risk Assessment
Overall AssessmentImpact EvaluationThe changes represent high-value, low-risk optimization for the llama.cpp codebase:
Maintainability Considerations
Future Performance Outlook
RecommendationApprove with monitoring: The PR delivers significant performance improvements for ROCm users while maintaining stability for other platforms. The minimal core degradation (0.066% PLT overhead) is acceptable given the substantial gains achieved. Implement performance monitoring for decode scenarios and consider refactoring kernel selection logic in future iterations to reduce maintenance complexity. Priority: The changes address a critical performance regression in ROCm builds and should be merged to restore competitive performance for AMD GPU users in the llama.cpp ecosystem. |
1983956 to
326a60a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Mirrored from ggml-org/llama.cpp#16827
In the HIP BUILD docs
-DGGML_HIP_ROCWMMA_FATTN=ONis recommended for improved FA performance for RDNA3+/CDNA and in broadpp512/tg128performance testing it is usually the best option, but some users have noticed there is severe performance degradation, especially with decode (tg) as context gets longer.I noticed too, and while I wwas doing some other spelunking, found what seemed like some relatively easy wins. There was a bit more fussing than I expected but ended up with a relatively clean patch that both fixes the long context tg regression and also optimizes the WMMA path for RDNA.
The perf improvements are non-trivial and since the changes are all isolated, hopefully it won't be too hard to merge. Here's some performance testing on my Strix Halo (RDNA3.5) w/ ROCm 7.10.0a20251018:
Llama 3.2 1B Q4_K_M
Previous rocWMMA vs HIP
Prefill (pp)
Decode (tg)
My rocWMMA vs HIP
Prefill (pp)
Decode (tg)
My rocWMMA vs Previous rocWMMA
Prefill (pp)
Decode (tg)
gpt-oss-20b F16/MXFP4
Previous rocWMMA vs HIP
Prefill (pp)
Decode (tg)
My rocWMMA vs HIP
Prefill (pp)
Decode (tg)
My rocWMMA vs Previous rocWMMA
Prefill (pp)
Decode (tg)
I only tested small models while I was deving, but am running gpt-oss-120b overnight, since llama 3.2b dense and gpt-oss-20b moe have similar gains, expecting something not so different as context grows...