Skip to content

UPSTREAM PR #16829: cpu: introduce chunking for flash attention#6

Closed
DajanaV wants to merge 1 commit intomainfrom
upstream-PR16829-branch_qualcomm-flashattn-chunking
Closed

UPSTREAM PR #16829: cpu: introduce chunking for flash attention#6
DajanaV wants to merge 1 commit intomainfrom
upstream-PR16829-branch_qualcomm-flashattn-chunking

Conversation

@DajanaV
Copy link
Copy Markdown
Collaborator

@DajanaV DajanaV commented Oct 28, 2025

Mirrored from ggml-org/llama.cpp#16829

A very simple dynamic Flash Attention chunking that splits the work into n_threads * 4 chunks.

This helps on platforms with significant performance difference between the CPU cores (ie big.LITTLE, boosted cores, etc) and it helps under heavy CPU load. Very similar to what MatMul and MatMul-ID chunking already does.

Flash Attention is a relatively small part of the overall profile so the end-to-end token rate is not affected that much but if I run it in isolation I see a nice bump in performance on the Gen5.

## Snapdragon Gen5 LLama3.2 3B Q4_0 (most Ops but FA are disabled)
before
llama_perf_context_print: prompt eval time =     258.31 ms /   205 tokens (    1.26 ms per token,   793.61 tokens per second)
llama_perf_context_print:        eval time =     499.05 ms /    63 runs   (    7.92 ms per token,   126.24 tokens per second)

after
llama_perf_context_print: prompt eval time =     216.11 ms /   205 tokens (    1.05 ms per token,   948.60 tokens per second)
llama_perf_context_print:        eval time =     477.52 ms /    63 runs   (    7.58 ms per token,   131.93 tokens per second)

## Snapdragon Gen5 LLama3.2 3B Q4_0 (most Ops but FA are disabled)
before
llama_perf_context_print: prompt eval time =     171.04 ms /   205 tokens (    0.83 ms per token,  1198.56 tokens per second)
llama_perf_context_print:        eval time =     290.58 ms /    63 runs   (    4.61 ms per token,   216.81 tokens per second)

after
llama_perf_context_print: prompt eval time =     164.80 ms /   205 tokens (    0.80 ms per token,  1243.91 tokens per second)
llama_perf_context_print:        eval time =     285.91 ms /    63 runs   (    4.54 ms per token,   220.35 tokens per second)

Also tested on the M4 Pro where I don't see any performance changes on the unloaded system but the loaded system is a different story.
Here are some more details with additional instrumentation that measures how many chunks each thread processed and how long it took.
You can see how under load some threads process more chunks in about the same amount of time on the M4 Pro.
On the Gen5 you can see that one of the cores crunches through many more chunks than the other cores.
The picture is similar on the Gen4 (8-Elite).

Details
M4 Pro (GPT-OSS-20B) 6 threads
Under heavy load (compiling llama.cpp with x86-64 android-ndk)
thread-3: fa __fattn__-23 proc-chunks 4 proc-usec 3440
thread-4: fa __fattn__-23 proc-chunks 4 proc-usec 3518
thread-1: fa __fattn__-23 proc-chunks 4 proc-usec 3550
thread-0: fa __fattn__-23 proc-chunks 4 proc-usec 3615
thread-2: fa __fattn__-23 proc-chunks 4 proc-usec 3680
thread-5: fa __fattn__-23 proc-chunks 4 proc-usec 3891
thread-5: fa __fattn__-0 proc-chunks 4 proc-usec 3137
thread-0: fa __fattn__-0 proc-chunks 4 proc-usec 3178
thread-3: fa __fattn__-0 proc-chunks 4 proc-usec 3241
thread-4: fa __fattn__-0 proc-chunks 5 proc-usec 3857
thread-1: fa __fattn__-0 proc-chunks 5 proc-usec 3956
thread-2: fa __fattn__-0 proc-chunks 2 proc-usec 4815
thread-3: fa __fattn__-1 proc-chunks 5 proc-usec 4924
thread-5: fa __fattn__-1 proc-chunks 2 proc-usec 5611
thread-4: fa __fattn__-1 proc-chunks 3 proc-usec 5713
thread-2: fa __fattn__-1 proc-chunks 6 proc-usec 5735
thread-1: fa __fattn__-1 proc-chunks 6 proc-usec 5853
thread-0: fa __fattn__-1 proc-chunks 2 proc-usec 6049
thread-0: fa __fattn__-2 proc-chunks 4 proc-usec 3204
thread-4: fa __fattn__-2 proc-chunks 4 proc-usec 3309
thread-5: fa __fattn__-2 proc-chunks 2 proc-usec 3374
thread-2: fa __fattn__-2 proc-chunks 5 proc-usec 3915
thread-3: fa __fattn__-2 proc-chunks 5 proc-usec 3999
thread-1: fa __fattn__-2 proc-chunks 4 proc-usec 5146
thread-5: fa __fattn__-3 proc-chunks 4 proc-usec 3829
thread-2: fa __fattn__-3 proc-chunks 4 proc-usec 3973
thread-3: fa __fattn__-3 proc-chunks 5 proc-usec 4420
thread-4: fa __fattn__-3 proc-chunks 4 proc-usec 4615
thread-0: fa __fattn__-3 proc-chunks 5 proc-usec 4732
thread-1: fa __fattn__-3 proc-chunks 2 proc-usec 4775

Snapdragon 8E Gen5 (GPT-OSS-20B) 6 threads
thread-4: fa __fattn__-0 proc-chunks 2 proc-usec 4476
thread-0: fa __fattn__-0 proc-chunks 12 proc-usec 4565
thread-2: fa __fattn__-0 proc-chunks 2 proc-usec 4530
thread-5: fa __fattn__-0 proc-chunks 2 proc-usec 4720
thread-1: fa __fattn__-0 proc-chunks 3 proc-usec 6863
thread-3: fa __fattn__-0 proc-chunks 3 proc-usec 7170
thread-3: fa __fattn__-1 proc-chunks 2 proc-usec 5105
thread-0: fa __fattn__-1 proc-chunks 14 proc-usec 5242
thread-1: fa __fattn__-1 proc-chunks 2 proc-usec 5285
thread-4: fa __fattn__-1 proc-chunks 2 proc-usec 5435
thread-2: fa __fattn__-1 proc-chunks 2 proc-usec 5478
thread-5: fa __fattn__-1 proc-chunks 2 proc-usec 5593
thread-1: fa __fattn__-2 proc-chunks 2 proc-usec 4740
thread-0: fa __fattn__-2 proc-chunks 13 proc-usec 4827
thread-5: fa __fattn__-2 proc-chunks 2 proc-usec 4831
thread-4: fa __fattn__-2 proc-chunks 2 proc-usec 4894
thread-2: fa __fattn__-2 proc-chunks 2 proc-usec 5439
thread-3: fa __fattn__-2 proc-chunks 3 proc-usec 7006
thread-2: fa __fattn__-3 proc-chunks 2 proc-usec 3843
thread-5: fa __fattn__-3 proc-chunks 2 proc-usec 4030
thread-0: fa __fattn__-3 proc-chunks 11 proc-usec 4111
thread-4: fa __fattn__-3 proc-chunks 3 proc-usec 5664
thread-1: fa __fattn__-3 proc-chunks 3 proc-usec 5795
thread-3: fa __fattn__-3 proc-chunks 3 proc-usec 5820

Galaxy S25+ (Llama 3.2 3B) 6 threads
thread-0: fa __fattn__-10 proc-chunks 6 proc-usec 78
thread-5: fa __fattn__-10 proc-chunks 3 proc-usec 80
thread-2: fa __fattn__-10 proc-chunks 3 proc-usec 80
thread-4: fa __fattn__-10 proc-chunks 3 proc-usec 80
thread-1: fa __fattn__-10 proc-chunks 3 proc-usec 80
thread-3: fa __fattn__-10 proc-chunks 6 proc-usec 78
thread-0: fa __fattn__-11 proc-chunks 6 proc-usec 75
thread-5: fa __fattn__-11 proc-chunks 6 proc-usec 75
thread-1: fa __fattn__-11 proc-chunks 3 proc-usec 78
thread-3: fa __fattn__-11 proc-chunks 3 proc-usec 78
thread-2: fa __fattn__-11 proc-chunks 3 proc-usec 78
thread-4: fa __fattn__-11 proc-chunks 3 proc-usec 78

I'm going to submit a couple more related PRs:

  • Enabling CPU MatMul-ID chunking on ARM64
  • Introducing a very similar chunking ie chunk_size = nrows / (n_threads * 4) for the Repack MatMuls.

Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop
on top that handles the chunks.
@loci-review-dev
Copy link
Copy Markdown

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: llama.cpp PR #6 Flash Attention Chunking

Key Findings

Performance Degradations Identified

Minimal Performance Impact:

  • Response Time: Worst degradation in _RegexMask constructor (0.082% increase, 18.4 picoseconds)
  • Throughput: Worst degradation in make_unique for graph input position bucket (0.117% increase, 121.7 picoseconds)
  • Bottleneck: Same function as throughput with 0.141% increase (121.7 picoseconds)

Core Function Impact Assessment:

  • No direct impact on core inference functions: The identified degradations occur in utility functions (_RegexMask constructor, make_unique template) rather than critical inference paths
  • Flash Attention optimization: PR UPSTREAM PR #16829: cpu: introduce chunking for flash attention #6 introduces chunking for Flash Attention, which is a performance-critical component but represents a small portion of overall inference profile
  • Inference pipeline preservation: Core model loading, tokenization, and sampling functions remain unaffected

Power Consumption Analysis

Negligible Energy Impact:

  • build.bin.libllama.so: -0.001% change (303,377 nJ vs 303,379 nJ) - effectively no change
  • All other binaries: 0.0% change across libggml.so, libggml-cpu.so, and libggml-base.so
  • Overall assessment: Computational workload remains stable with no meaningful energy consumption changes

Flame Graph and CFG Analysis Insights

_RegexMask Constructor Analysis:

  • Single basic block execution: No function calls or complex control flow
  • Identical assembly code: Both versions contain byte-for-byte identical ARM64 instructions
  • Performance paradox: 0.082% degradation despite identical code suggests micro-architectural effects (cache alignment, instruction placement, or memory layout changes)
  • Optimization focus: Issue lies outside instruction stream - likely binary layout or memory subsystem timing

Root Cause Assessment:

  • Not a code regression: Performance difference stems from build-time or link-time factors rather than algorithmic changes
  • Measurement artifact potential: The minimal degradation may be within measurement noise or caused by external factors

GitHub Code Review Critical Issues

Flash Attention Chunking Implementation:

  • Well-engineered optimization: Introduces dynamic work-stealing for heterogeneous CPU architectures
  • Proven performance gains: 16-19% improvement in Flash Attention operations on Snapdragon Gen5
  • Architecture-aware design: NUMA-conscious fallback prevents performance regressions
  • No critical risks identified: Implementation follows established patterns with appropriate safeguards

Threading Complexity Considerations:

  • Increased coordination overhead: Added barrier synchronization and chunk management
  • Debug complexity: Work-stealing makes performance analysis more challenging
  • Fallback mechanisms: Proper handling of single-threaded and NUMA scenarios

Overall Assessment

Change Impact Evaluation

Positive Outcomes:

  • Targeted optimization success: Flash Attention chunking delivers measurable improvements on heterogeneous architectures without affecting core inference logic
  • Minimal performance regression: Identified degradations are sub-0.15% and occur in non-critical utility functions
  • Energy efficiency maintained: No meaningful change in power consumption across all binaries
  • Architecture compatibility: Changes enhance performance on big.LITTLE and boosted core systems while preserving compatibility

Technical Quality:

  • Code structure preservation: Core llama.cpp architecture remains intact with changes isolated to GGML tensor operations
  • Performance-conscious implementation: Chunking algorithm includes appropriate safeguards and fallback mechanisms
  • Maintainability: Clear separation between chunk orchestration and computation logic enhances code organization

Maintainability and Future Considerations

Maintainability Assessment:

  • Positive: Modular design with clear separation of concerns between chunking logic and computation
  • Positive: NUMA-aware fallbacks prevent performance regressions on different architectures
  • Consideration: Increased threading complexity requires careful attention during future modifications

Future Performance Outlook:

  • Scalability: Dynamic chunking provides foundation for better multi-core utilization as models grow larger
  • Adaptability: Architecture-aware design positions codebase well for emerging heterogeneous processors
  • Optimization potential: Framework established for similar chunking optimizations in other compute-intensive operations

Recommendation:
The changes represent a net positive improvement to the llama.cpp codebase. The Flash Attention chunking optimization delivers meaningful performance gains on target architectures while the minimal utility function degradations appear to be measurement artifacts rather than genuine regressions. The implementation quality is high with appropriate safeguards, making this a low-risk enhancement that improves the project's performance profile on modern heterogeneous CPU architectures.

@DajanaV DajanaV force-pushed the main branch 2 times, most recently from 1983956 to 326a60a Compare October 29, 2025 12:13
@DajanaV DajanaV added the dev-stale Stale dev environment — dashboard not accessible label Oct 30, 2025
@DajanaV DajanaV deleted the branch main October 30, 2025 15:25
@DajanaV DajanaV closed this Oct 30, 2025
@DajanaV DajanaV deleted the upstream-PR16829-branch_qualcomm-flashattn-chunking branch October 30, 2025 15:26
loci-dev pushed a commit that referenced this pull request Nov 30, 2025
loci-dev pushed a commit that referenced this pull request Feb 17, 2026
Support device-specific host buffer types in meta backend
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev-stale Stale dev environment — dashboard not accessible

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants