UPSTREAM PR #16829: cpu: introduce chunking for flash attention#6
UPSTREAM PR #16829: cpu: introduce chunking for flash attention#6
Conversation
Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop on top that handles the chunks.
|
Access the complete analysis in the LOCI Dashboard Performance Analysis Summary: llama.cpp PR #6 Flash Attention ChunkingKey FindingsPerformance Degradations IdentifiedMinimal Performance Impact:
Core Function Impact Assessment:
Power Consumption AnalysisNegligible Energy Impact:
Flame Graph and CFG Analysis Insights_RegexMask Constructor Analysis:
Root Cause Assessment:
GitHub Code Review Critical IssuesFlash Attention Chunking Implementation:
Threading Complexity Considerations:
Overall AssessmentChange Impact EvaluationPositive Outcomes:
Technical Quality:
Maintainability and Future ConsiderationsMaintainability Assessment:
Future Performance Outlook:
Recommendation: |
1983956 to
326a60a
Compare
Support device-specific host buffer types in meta backend
Mirrored from ggml-org/llama.cpp#16829
A very simple dynamic Flash Attention chunking that splits the work into
n_threads * 4chunks.This helps on platforms with significant performance difference between the CPU cores (ie big.LITTLE, boosted cores, etc) and it helps under heavy CPU load. Very similar to what MatMul and MatMul-ID chunking already does.
Flash Attention is a relatively small part of the overall profile so the end-to-end token rate is not affected that much but if I run it in isolation I see a nice bump in performance on the Gen5.
Also tested on the M4 Pro where I don't see any performance changes on the unloaded system but the loaded system is a different story.
Here are some more details with additional instrumentation that measures how many chunks each thread processed and how long it took.
You can see how under load some threads process more chunks in about the same amount of time on the M4 Pro.
On the Gen5 you can see that one of the cores crunches through many more chunks than the other cores.
The picture is similar on the Gen4 (8-Elite).
Details
I'm going to submit a couple more related PRs:
chunk_size = nrows / (n_threads * 4)for the Repack MatMuls.