hexagon: Flash Attention optimizations (dma, mpyacc, multi-row) and MatMul updates by max-krasnyansky · Pull Request #20118 · ggml-org/llama.cpp

max-krasnyansky · 2026-03-04T23:21:17Z

Further updates on top of #19780 by @chraac

Improved DMA pipelining in FA
Reduced FA block size from 128 to 64 to improve DMA prefetch (128 is too big for most models)
Improved usage or vmpyacc intrinsics in dot products

Some quick perf numbers on S25+ (Gen4) with Llama3.2-3B-Q4_0 and FA on Hexagon

Before:
common_perf_print: prompt eval time =    2711.51 ms /   205 tokens (   13.23 ms per token,    75.60 tokens per second)
common_perf_print:        eval time =    3934.46 ms /    63 runs   (   62.45 ms per token,    16.01 tokens per second)

After:
common_perf_print: prompt eval time =    2538.64 ms /   205 tokens (   12.38 ms per token,    80.75 tokens per second)
common_perf_print:        eval time =    3586.00 ms /    63 runs   (   56.92 ms per token,    17.57 tokens per second)

Original notes from #19780

Dot Product Function Improvements

Replaced the previous hvx_dot_f16_f16_aa_rx2 function with new, more parallelized hvx_dot_f16_f16_aa_rx4 and
hvx_dot_f16_f16_aa_rx32 functions in flash-attn-ops.c, allowing computation of 4 and 32 dot products at a time,
respectively. This increases throughput and simplifies the code by leveraging vectorization.
Updated the main attention kernel (flash_attn_ext_f16_thread) to use the new hvx_dot_f16_f16_aa_rx32 function,
replacing the looped calls to the old function and removing the need for temporary arrays.

Vector Reduction Utilities

Added hvx_vec_reduce_sum_f32x4 utility in hvx-reduce.h for both HVX architectures, enabling efficient reduction
of four HVX vector results into a single vector. This supports the new parallel dot product functions.

… by expanding vector handling and optimizing accumulation # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c

…uce_sum_f32x4 for improved performance and reduced complexity

…ing in flash attention # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c

…ng unused scale parameter and improving vector accumulation # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c

…y and return HVX_Vector for better integration # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c

… clarity

…g masking

… counts as parameters for improved clarity and flexibility

chraac · 2026-03-05T06:17:25Z

ggml/src/ggml-hexagon/htp/hvx-base.h

 }

-static inline HVX_Vector hvx_vec_splat_f16(float v) {
+static inline HVX_Vector hvx_vec_splat_f16(_Float16 v) {


nit: better to keep the same type (__fp16) as union below.

__fp16 can't be used as a function argument, would have to be a pointer.
_Float16 can. That's pretty much the only reason I used it.

From gcc doc, there's a minor different between __fp16 and _Float16 in some arch, but both okay in this case

…atMul updates (ggml-org#20118) * ggml-hexagon: enhance hvx_dot_f16_f16_aa_rx4 for improved performance by expanding vector handling and optimizing accumulation # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c * ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx4 and enhance hvx_vec_reduce_sum_f32x4 for improved performance and reduced complexity * ggml-hexagon: add hvx_dot_f16_f16_aa_rx32 for enhanced vector processing in flash attention # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c * optimize hvx_dot_f16_f16_aa_rx4 and hvx_dot_f16_f16_aa_rx32 by removing unused scale parameter and improving vector accumulation # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c * ggml-hexagon: refactor hvx_dot_f16_f16_aa_rx4 for improved readability and return HVX_Vector for better integration # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c * ggml-hexagon: initialize sums variable in hvx_dot_f16_f16_aa_rx32 for clarity * ggml-hexagon: fix compiling error * fix hvx_dot_f16_f16_aa_rx4 to handle leftover elements correctly using masking * refactor hvx_dot_f16_f16_aa_rx4 to accept vector and leftover element counts as parameters for improved clarity and flexibility * wip * fa: instrumentation and dma reordering * hex-fa: use block-size 64 to improve DMA pipelining * hex-fa: optimize vec-dot for v79 and above * hex-fa: use block size 64 * hex-fa: avoid scalar fp32->fp16 conversions * hex-fa: simplify dot_f16 functions using optimized vec_mpyacc * hex-fa: rewrite mad_f32_f16 using hvx_vec_mpyacc * hex-mm: use mpyacc in matmul dot functions --------- Co-authored-by: chraac <[email protected]>

chraac and others added 19 commits March 4, 2026 11:37

ggml-hexagon: enhance hvx_dot_f16_f16_aa_rx4 for improved performance…

c367b05

… by expanding vector handling and optimizing accumulation # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c

ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx4 and enhance hvx_vec_red…

f445ffe

…uce_sum_f32x4 for improved performance and reduced complexity

ggml-hexagon: add hvx_dot_f16_f16_aa_rx32 for enhanced vector process…

6e0bacc

…ing in flash attention # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c

optimize hvx_dot_f16_f16_aa_rx4 and hvx_dot_f16_f16_aa_rx32 by removi…

0c0bcf1

…ng unused scale parameter and improving vector accumulation # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c

ggml-hexagon: refactor hvx_dot_f16_f16_aa_rx4 for improved readabilit…

544dbae

…y and return HVX_Vector for better integration # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c

ggml-hexagon: initialize sums variable in hvx_dot_f16_f16_aa_rx32 for…

787e393

… clarity

ggml-hexagon: fix compiling error

67fa580

fix hvx_dot_f16_f16_aa_rx4 to handle leftover elements correctly usin…

69e65d1

…g masking

refactor hvx_dot_f16_f16_aa_rx4 to accept vector and leftover element…

6068664

… counts as parameters for improved clarity and flexibility

wip

be35a18

fa: instrumentation and dma reordering

c4165f0

hex-fa: use block-size 64 to improve DMA pipelining

0978aeb

hex-fa: optimize vec-dot for v79 and above

d0f8d28

hex-fa: use block size 64

0396473

hex-fa: avoid scalar fp32->fp16 conversions

bbfa942

hex-fa: simplify dot_f16 functions using optimized vec_mpyacc

ef65e62

hex-fa: rewrite mad_f32_f16 using hvx_vec_mpyacc

3f60932

hex-mm: use mpyacc in matmul dot functions

7c270ea

Merge branch 'ggml-org:master' into hexagon-fa-updates-dma-mpyacc

56f10cb

max-krasnyansky requested a review from lhez as a code owner March 4, 2026 23:21

max-krasnyansky assigned lhez Mar 4, 2026

max-krasnyansky mentioned this pull request Mar 4, 2026

[WIP] ggml-hexagon: convert f32 to f16 - fa opt part4 #19780

Closed

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 5, 2026

lhez approved these changes Mar 5, 2026

View reviewed changes

max-krasnyansky merged commit 7a99dc8 into ggml-org:master Mar 5, 2026
78 checks passed

chraac reviewed Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hexagon: Flash Attention optimizations (dma, mpyacc, multi-row) and MatMul updates#20118

hexagon: Flash Attention optimizations (dma, mpyacc, multi-row) and MatMul updates#20118
max-krasnyansky merged 19 commits intoggml-org:masterfrom
qualcomm:hexagon-fa-updates-dma-mpyacc

max-krasnyansky commented Mar 4, 2026

Uh oh!

Uh oh!

chraac Mar 5, 2026

Uh oh!

max-krasnyansky Mar 5, 2026

Uh oh!

chraac Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

max-krasnyansky commented Mar 4, 2026

Dot Product Function Improvements

Vector Reduction Utilities

Uh oh!

Uh oh!

chraac Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

chraac Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants