[WIP] ggml-hexagon: convert f32 to f16 - fa opt part4 by chraac · Pull Request #19780 · ggml-org/llama.cpp

chraac · 2026-02-21T15:13:47Z

Key Changes

Dot Product Function Improvements

Replaced the previous hvx_dot_f16_f16_aa_rx2 function with new, more parallelized hvx_dot_f16_f16_aa_rx4 and hvx_dot_f16_f16_aa_rx32 functions in flash-attn-ops.c, allowing computation of 4 and 32 dot products at a time, respectively. This increases throughput and simplifies the code by leveraging vectorization.
Updated the main attention kernel (flash_attn_ext_f16_thread) to use the new hvx_dot_f16_f16_aa_rx32 function, replacing the looped calls to the old function and removing the need for temporary arrays.

Vector Reduction Utilities

Added hvx_vec_reduce_sum_f32x4 utility in hvx-reduce.h for both HVX architectures, enabling efficient reduction of four HVX vector results into a single vector. This supports the new parallel dot product functions. [1] [2]

Performance

// TODO

… by expanding vector handling and optimizing accumulation # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c

…uce_sum_f32x4 for improved performance and reduced complexity

…ing in flash attention # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c

…ng unused scale parameter and improving vector accumulation # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c

…y and return HVX_Vector for better integration # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c

… clarity

…g masking

… counts as parameters for improved clarity and flexibility

chraac · 2026-02-21T15:19:35Z

ggml/src/ggml-hexagon/htp/flash-attn-ops.c

+    for (uint32_t j = 0; j < VLEN_FP32; j += 4) {
+        HVX_Vector     sums_x4 = hvx_dot_f16_f16_aa_rx4(y, x, stride_x, nvec, nloe);
+        HVX_VectorPred pred    = Q6_Q_vsetq_R(j * SIZEOF_FP32);
+        sums                   = Q6_V_vmux_QVV(pred, sums, sums_x4);


Key Improvement: Instead of using hvx_vec_store_u, we now use vmux to directly insert the 4 floats into the result vector.

max-krasnyansky · 2026-02-25T06:50:32Z

I got some more updates on top for better DMA pipelining and some optimizations for Hexagon v79 and up.
Need to test a bit more, will try to push and share tomorrow.

max-krasnyansky · 2026-03-04T23:24:08Z

@chraac
Sorry for the delay. Was going to just send you a couple of little changes to add to this PR but ended up doing a bunch of things on top to futher cleanup and optimize. Might as well do a new PR that includes all your changes as well.
#20118

max-krasnyansky · 2026-03-04T23:24:39Z

Included in #20118

chraac added 10 commits February 19, 2026 19:28

ggml-hexagon: enhance hvx_dot_f16_f16_aa_rx4 for improved performance…

95216e9

… by expanding vector handling and optimizing accumulation # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c

ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx4 and enhance hvx_vec_red…

ad263e3

…uce_sum_f32x4 for improved performance and reduced complexity

ggml-hexagon: add hvx_dot_f16_f16_aa_rx32 for enhanced vector process…

a1d1649

…ing in flash attention # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c

optimize hvx_dot_f16_f16_aa_rx4 and hvx_dot_f16_f16_aa_rx32 by removi…

a3232e4

…ng unused scale parameter and improving vector accumulation # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c

ggml-hexagon: refactor hvx_dot_f16_f16_aa_rx4 for improved readabilit…

e7830d2

…y and return HVX_Vector for better integration # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c

ggml-hexagon: initialize sums variable in hvx_dot_f16_f16_aa_rx32 for…

1ec9bb0

… clarity

ggml-hexagon: fix compiling error

9d7651a

fix hvx_dot_f16_f16_aa_rx4 to handle leftover elements correctly usin…

bb47df8

…g masking

refactor hvx_dot_f16_f16_aa_rx4 to accept vector and leftover element…

78e1f8b

… counts as parameters for improved clarity and flexibility

wip

d71cf1d

chraac requested review from lhez and max-krasnyansky as code owners February 21, 2026 15:13

chraac marked this pull request as draft February 21, 2026 15:13

chraac commented Feb 21, 2026

View reviewed changes

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 21, 2026

Merge tag 'b8173' into dev-fa-opt-part4

e0ad2dd

max-krasnyansky mentioned this pull request Mar 4, 2026

hexagon: Flash Attention optimizations (dma, mpyacc, multi-row) and MatMul updates #20118

Merged

max-krasnyansky closed this Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] ggml-hexagon: convert f32 to f16 - fa opt part4#19780

[WIP] ggml-hexagon: convert f32 to f16 - fa opt part4#19780
chraac wants to merge 11 commits intoggml-org:masterfrom
chraac:dev-fa-opt-part4

chraac commented Feb 21, 2026 •

edited

Loading

Uh oh!

chraac Feb 21, 2026

Uh oh!

max-krasnyansky commented Feb 25, 2026

Uh oh!

max-krasnyansky commented Mar 4, 2026

Uh oh!

max-krasnyansky commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chraac commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Dot Product Function Improvements

Vector Reduction Utilities

Performance

Uh oh!

chraac Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky commented Feb 25, 2026

Uh oh!

max-krasnyansky commented Mar 4, 2026

Uh oh!

max-krasnyansky commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chraac commented Feb 21, 2026 •

edited

Loading