Skip to content

[WIP] ggml-hexagon: convert f32 to f16 - fa opt part4#19780

Closed
chraac wants to merge 11 commits intoggml-org:masterfrom
chraac:dev-fa-opt-part4
Closed

[WIP] ggml-hexagon: convert f32 to f16 - fa opt part4#19780
chraac wants to merge 11 commits intoggml-org:masterfrom
chraac:dev-fa-opt-part4

Conversation

@chraac
Copy link
Contributor

@chraac chraac commented Feb 21, 2026

Key Changes

Dot Product Function Improvements

  • Replaced the previous hvx_dot_f16_f16_aa_rx2 function with new, more parallelized hvx_dot_f16_f16_aa_rx4 and hvx_dot_f16_f16_aa_rx32 functions in flash-attn-ops.c, allowing computation of 4 and 32 dot products at a time, respectively. This increases throughput and simplifies the code by leveraging vectorization.
  • Updated the main attention kernel (flash_attn_ext_f16_thread) to use the new hvx_dot_f16_f16_aa_rx32 function, replacing the looped calls to the old function and removing the need for temporary arrays.

Vector Reduction Utilities

  • Added hvx_vec_reduce_sum_f32x4 utility in hvx-reduce.h for both HVX architectures, enabling efficient reduction of four HVX vector results into a single vector. This supports the new parallel dot product functions. [1] [2]

Performance

// TODO

… by expanding vector handling and optimizing accumulation

# Conflicts:
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
…uce_sum_f32x4 for improved performance and reduced complexity
…ing in flash attention

# Conflicts:
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
…ng unused scale parameter and improving vector accumulation

# Conflicts:
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
…y and return HVX_Vector for better integration

# Conflicts:
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
… counts as parameters for improved clarity and flexibility
@chraac chraac marked this pull request as draft February 21, 2026 15:13
for (uint32_t j = 0; j < VLEN_FP32; j += 4) {
HVX_Vector sums_x4 = hvx_dot_f16_f16_aa_rx4(y, x, stride_x, nvec, nloe);
HVX_VectorPred pred = Q6_Q_vsetq_R(j * SIZEOF_FP32);
sums = Q6_V_vmux_QVV(pred, sums, sums_x4);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key Improvement: Instead of using hvx_vec_store_u, we now use vmux to directly insert the 4 floats into the result vector.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 21, 2026
@max-krasnyansky
Copy link
Member

I got some more updates on top for better DMA pipelining and some optimizations for Hexagon v79 and up.
Need to test a bit more, will try to push and share tomorrow.

@max-krasnyansky
Copy link
Member

@chraac
Sorry for the delay. Was going to just send you a couple of little changes to add to this PR but ended up doing a bunch of things on top to futher cleanup and optimize. Might as well do a new PR that includes all your changes as well.
#20118

@max-krasnyansky
Copy link
Member

Included in #20118

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants