UPSTREAM PR #16829: cpu: introduce chunking for flash attention by DajanaV · Pull Request #6 · auroralabs-loci/llama.cpp

DajanaV · 2025-10-28T21:04:38Z

A very simple dynamic Flash Attention chunking that splits the work into n_threads * 4 chunks.

This helps on platforms with significant performance difference between the CPU cores (ie big.LITTLE, boosted cores, etc) and it helps under heavy CPU load. Very similar to what MatMul and MatMul-ID chunking already does.

Flash Attention is a relatively small part of the overall profile so the end-to-end token rate is not affected that much but if I run it in isolation I see a nice bump in performance on the Gen5.

## Snapdragon Gen5 LLama3.2 3B Q4_0 (most Ops but FA are disabled)
before
llama_perf_context_print: prompt eval time =     258.31 ms /   205 tokens (    1.26 ms per token,   793.61 tokens per second)
llama_perf_context_print:        eval time =     499.05 ms /    63 runs   (    7.92 ms per token,   126.24 tokens per second)

after
llama_perf_context_print: prompt eval time =     216.11 ms /   205 tokens (    1.05 ms per token,   948.60 tokens per second)
llama_perf_context_print:        eval time =     477.52 ms /    63 runs   (    7.58 ms per token,   131.93 tokens per second)

## Snapdragon Gen5 LLama3.2 3B Q4_0 (most Ops but FA are disabled)
before
llama_perf_context_print: prompt eval time =     171.04 ms /   205 tokens (    0.83 ms per token,  1198.56 tokens per second)
llama_perf_context_print:        eval time =     290.58 ms /    63 runs   (    4.61 ms per token,   216.81 tokens per second)

after
llama_perf_context_print: prompt eval time =     164.80 ms /   205 tokens (    0.80 ms per token,  1243.91 tokens per second)
llama_perf_context_print:        eval time =     285.91 ms /    63 runs   (    4.54 ms per token,   220.35 tokens per second)

Also tested on the M4 Pro where I don't see any performance changes on the unloaded system but the loaded system is a different story.
Here are some more details with additional instrumentation that measures how many chunks each thread processed and how long it took.
You can see how under load some threads process more chunks in about the same amount of time on the M4 Pro.
On the Gen5 you can see that one of the cores crunches through many more chunks than the other cores.
The picture is similar on the Gen4 (8-Elite).

Details

M4 Pro (GPT-OSS-20B) 6 threads
Under heavy load (compiling llama.cpp with x86-64 android-ndk)
thread-3: fa __fattn__-23 proc-chunks 4 proc-usec 3440
thread-4: fa __fattn__-23 proc-chunks 4 proc-usec 3518
thread-1: fa __fattn__-23 proc-chunks 4 proc-usec 3550
thread-0: fa __fattn__-23 proc-chunks 4 proc-usec 3615
thread-2: fa __fattn__-23 proc-chunks 4 proc-usec 3680
thread-5: fa __fattn__-23 proc-chunks 4 proc-usec 3891
thread-5: fa __fattn__-0 proc-chunks 4 proc-usec 3137
thread-0: fa __fattn__-0 proc-chunks 4 proc-usec 3178
thread-3: fa __fattn__-0 proc-chunks 4 proc-usec 3241
thread-4: fa __fattn__-0 proc-chunks 5 proc-usec 3857
thread-1: fa __fattn__-0 proc-chunks 5 proc-usec 3956
thread-2: fa __fattn__-0 proc-chunks 2 proc-usec 4815
thread-3: fa __fattn__-1 proc-chunks 5 proc-usec 4924
thread-5: fa __fattn__-1 proc-chunks 2 proc-usec 5611
thread-4: fa __fattn__-1 proc-chunks 3 proc-usec 5713
thread-2: fa __fattn__-1 proc-chunks 6 proc-usec 5735
thread-1: fa __fattn__-1 proc-chunks 6 proc-usec 5853
thread-0: fa __fattn__-1 proc-chunks 2 proc-usec 6049
thread-0: fa __fattn__-2 proc-chunks 4 proc-usec 3204
thread-4: fa __fattn__-2 proc-chunks 4 proc-usec 3309
thread-5: fa __fattn__-2 proc-chunks 2 proc-usec 3374
thread-2: fa __fattn__-2 proc-chunks 5 proc-usec 3915
thread-3: fa __fattn__-2 proc-chunks 5 proc-usec 3999
thread-1: fa __fattn__-2 proc-chunks 4 proc-usec 5146
thread-5: fa __fattn__-3 proc-chunks 4 proc-usec 3829
thread-2: fa __fattn__-3 proc-chunks 4 proc-usec 3973
thread-3: fa __fattn__-3 proc-chunks 5 proc-usec 4420
thread-4: fa __fattn__-3 proc-chunks 4 proc-usec 4615
thread-0: fa __fattn__-3 proc-chunks 5 proc-usec 4732
thread-1: fa __fattn__-3 proc-chunks 2 proc-usec 4775

Snapdragon 8E Gen5 (GPT-OSS-20B) 6 threads
thread-4: fa __fattn__-0 proc-chunks 2 proc-usec 4476
thread-0: fa __fattn__-0 proc-chunks 12 proc-usec 4565
thread-2: fa __fattn__-0 proc-chunks 2 proc-usec 4530
thread-5: fa __fattn__-0 proc-chunks 2 proc-usec 4720
thread-1: fa __fattn__-0 proc-chunks 3 proc-usec 6863
thread-3: fa __fattn__-0 proc-chunks 3 proc-usec 7170
thread-3: fa __fattn__-1 proc-chunks 2 proc-usec 5105
thread-0: fa __fattn__-1 proc-chunks 14 proc-usec 5242
thread-1: fa __fattn__-1 proc-chunks 2 proc-usec 5285
thread-4: fa __fattn__-1 proc-chunks 2 proc-usec 5435
thread-2: fa __fattn__-1 proc-chunks 2 proc-usec 5478
thread-5: fa __fattn__-1 proc-chunks 2 proc-usec 5593
thread-1: fa __fattn__-2 proc-chunks 2 proc-usec 4740
thread-0: fa __fattn__-2 proc-chunks 13 proc-usec 4827
thread-5: fa __fattn__-2 proc-chunks 2 proc-usec 4831
thread-4: fa __fattn__-2 proc-chunks 2 proc-usec 4894
thread-2: fa __fattn__-2 proc-chunks 2 proc-usec 5439
thread-3: fa __fattn__-2 proc-chunks 3 proc-usec 7006
thread-2: fa __fattn__-3 proc-chunks 2 proc-usec 3843
thread-5: fa __fattn__-3 proc-chunks 2 proc-usec 4030
thread-0: fa __fattn__-3 proc-chunks 11 proc-usec 4111
thread-4: fa __fattn__-3 proc-chunks 3 proc-usec 5664
thread-1: fa __fattn__-3 proc-chunks 3 proc-usec 5795
thread-3: fa __fattn__-3 proc-chunks 3 proc-usec 5820

Galaxy S25+ (Llama 3.2 3B) 6 threads
thread-0: fa __fattn__-10 proc-chunks 6 proc-usec 78
thread-5: fa __fattn__-10 proc-chunks 3 proc-usec 80
thread-2: fa __fattn__-10 proc-chunks 3 proc-usec 80
thread-4: fa __fattn__-10 proc-chunks 3 proc-usec 80
thread-1: fa __fattn__-10 proc-chunks 3 proc-usec 80
thread-3: fa __fattn__-10 proc-chunks 6 proc-usec 78
thread-0: fa __fattn__-11 proc-chunks 6 proc-usec 75
thread-5: fa __fattn__-11 proc-chunks 6 proc-usec 75
thread-1: fa __fattn__-11 proc-chunks 3 proc-usec 78
thread-3: fa __fattn__-11 proc-chunks 3 proc-usec 78
thread-2: fa __fattn__-11 proc-chunks 3 proc-usec 78
thread-4: fa __fattn__-11 proc-chunks 3 proc-usec 78

I'm going to submit a couple more related PRs:

Enabling CPU MatMul-ID chunking on ARM64
Introducing a very similar chunking ie chunk_size = nrows / (n_threads * 4) for the Repack MatMuls.

Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop on top that handles the chunks.

loci-review-dev · 2025-10-28T22:19:26Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: llama.cpp PR #6 Flash Attention Chunking

Key Findings

Performance Degradations Identified

Minimal Performance Impact:

Response Time: Worst degradation in _RegexMask constructor (0.082% increase, 18.4 picoseconds)
Throughput: Worst degradation in make_unique for graph input position bucket (0.117% increase, 121.7 picoseconds)
Bottleneck: Same function as throughput with 0.141% increase (121.7 picoseconds)

Core Function Impact Assessment:

No direct impact on core inference functions: The identified degradations occur in utility functions (_RegexMask constructor, make_unique template) rather than critical inference paths
Flash Attention optimization: PR UPSTREAM PR #16829: cpu: introduce chunking for flash attention #6 introduces chunking for Flash Attention, which is a performance-critical component but represents a small portion of overall inference profile
Inference pipeline preservation: Core model loading, tokenization, and sampling functions remain unaffected

Power Consumption Analysis

Negligible Energy Impact:

build.bin.libllama.so: -0.001% change (303,377 nJ vs 303,379 nJ) - effectively no change
All other binaries: 0.0% change across libggml.so, libggml-cpu.so, and libggml-base.so
Overall assessment: Computational workload remains stable with no meaningful energy consumption changes

Flame Graph and CFG Analysis Insights

_RegexMask Constructor Analysis:

Single basic block execution: No function calls or complex control flow
Identical assembly code: Both versions contain byte-for-byte identical ARM64 instructions
Performance paradox: 0.082% degradation despite identical code suggests micro-architectural effects (cache alignment, instruction placement, or memory layout changes)
Optimization focus: Issue lies outside instruction stream - likely binary layout or memory subsystem timing

Root Cause Assessment:

Not a code regression: Performance difference stems from build-time or link-time factors rather than algorithmic changes
Measurement artifact potential: The minimal degradation may be within measurement noise or caused by external factors

GitHub Code Review Critical Issues

Flash Attention Chunking Implementation:

Well-engineered optimization: Introduces dynamic work-stealing for heterogeneous CPU architectures
Proven performance gains: 16-19% improvement in Flash Attention operations on Snapdragon Gen5
Architecture-aware design: NUMA-conscious fallback prevents performance regressions
No critical risks identified: Implementation follows established patterns with appropriate safeguards

Threading Complexity Considerations:

Increased coordination overhead: Added barrier synchronization and chunk management
Debug complexity: Work-stealing makes performance analysis more challenging
Fallback mechanisms: Proper handling of single-threaded and NUMA scenarios

Overall Assessment

Change Impact Evaluation

Positive Outcomes:

Targeted optimization success: Flash Attention chunking delivers measurable improvements on heterogeneous architectures without affecting core inference logic
Minimal performance regression: Identified degradations are sub-0.15% and occur in non-critical utility functions
Energy efficiency maintained: No meaningful change in power consumption across all binaries
Architecture compatibility: Changes enhance performance on big.LITTLE and boosted core systems while preserving compatibility

Technical Quality:

Code structure preservation: Core llama.cpp architecture remains intact with changes isolated to GGML tensor operations
Performance-conscious implementation: Chunking algorithm includes appropriate safeguards and fallback mechanisms
Maintainability: Clear separation between chunk orchestration and computation logic enhances code organization

Maintainability and Future Considerations

Maintainability Assessment:

Positive: Modular design with clear separation of concerns between chunking logic and computation
Positive: NUMA-aware fallbacks prevent performance regressions on different architectures
Consideration: Increased threading complexity requires careful attention during future modifications

Future Performance Outlook:

Scalability: Dynamic chunking provides foundation for better multi-core utilization as models grow larger
Adaptability: Architecture-aware design positions codebase well for emerging heterogeneous processors
Optimization potential: Framework established for similar chunking optimizations in other compute-intensive operations

Recommendation:
The changes represent a net positive improvement to the llama.cpp codebase. The Flash Attention chunking optimization delivers meaningful performance gains on target architectures while the minimal utility function degradations appear to be measurement artifacts rather than genuine regressions. The implementation quality is high with appropriate safeguards, making this a low-risk enhancement that improves the project's performance profile on modern heterogeneous CPU architectures.

mtmd: quick fix token order

Support device-specific host buffer types in meta backend

cpu: introduce chunking for flash attention

e2364c9

Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop on top that handles the chunks.

DajanaV force-pushed the main branch 2 times, most recently from 1983956 to 326a60a Compare October 29, 2025 12:13

DajanaV added the dev-stale Stale dev environment — dashboard not accessible label Oct 30, 2025

DajanaV deleted the branch main October 30, 2025 15:25

DajanaV closed this Oct 30, 2025

DajanaV deleted the upstream-PR16829-branch_qualcomm-flashattn-chunking branch October 30, 2025 15:26

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Open

loci-dev pushed a commit that referenced this pull request Nov 30, 2025

Merge pull request #6 from bluebread/sf/deepseek-ocr

8810940

mtmd: quick fix token order

loci-dev pushed a commit that referenced this pull request Feb 17, 2026

Merge pull request #6 from gaugarg-nv/get_host_buffer_type

f0198ef

Support device-specific host buffer types in meta backend

loci-dev mentioned this pull request Mar 21, 2026

UPSTREAM PR #17342: Throughput improvement for small batch sizes #1279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #16829: cpu: introduce chunking for flash attention#6

UPSTREAM PR #16829: cpu: introduce chunking for flash attention#6
DajanaV wants to merge 1 commit intomainfrom
upstream-PR16829-branch_qualcomm-flashattn-chunking

DajanaV commented Oct 28, 2025

Uh oh!

loci-review-dev bot commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DajanaV commented Oct 28, 2025

Uh oh!

loci-review-dev bot commented Oct 28, 2025

Performance Analysis Summary: llama.cpp PR #6 Flash Attention Chunking

Key Findings

Performance Degradations Identified

Power Consumption Analysis

Flame Graph and CFG Analysis Insights

GitHub Code Review Critical Issues

Overall Assessment

Change Impact Evaluation

Maintainability and Future Considerations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants