[WIP] ggml-hexagon: convert f32 to f16 - fa opt part4#19780
Closed
chraac wants to merge 11 commits intoggml-org:masterfrom
Closed
[WIP] ggml-hexagon: convert f32 to f16 - fa opt part4#19780chraac wants to merge 11 commits intoggml-org:masterfrom
chraac wants to merge 11 commits intoggml-org:masterfrom
Conversation
… by expanding vector handling and optimizing accumulation # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c
…uce_sum_f32x4 for improved performance and reduced complexity
…ing in flash attention # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c
…ng unused scale parameter and improving vector accumulation # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c
…y and return HVX_Vector for better integration # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c
… counts as parameters for improved clarity and flexibility
chraac
commented
Feb 21, 2026
| for (uint32_t j = 0; j < VLEN_FP32; j += 4) { | ||
| HVX_Vector sums_x4 = hvx_dot_f16_f16_aa_rx4(y, x, stride_x, nvec, nloe); | ||
| HVX_VectorPred pred = Q6_Q_vsetq_R(j * SIZEOF_FP32); | ||
| sums = Q6_V_vmux_QVV(pred, sums, sums_x4); |
Contributor
Author
There was a problem hiding this comment.
Key Improvement: Instead of using hvx_vec_store_u, we now use vmux to directly insert the 4 floats into the result vector.
Member
|
I got some more updates on top for better DMA pipelining and some optimizations for Hexagon v79 and up. |
Member
Member
|
Included in #20118 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Key Changes
Dot Product Function Improvements
hvx_dot_f16_f16_aa_rx2function with new, more parallelizedhvx_dot_f16_f16_aa_rx4andhvx_dot_f16_f16_aa_rx32functions inflash-attn-ops.c, allowing computation of 4 and 32 dot products at a time, respectively. This increases throughput and simplifies the code by leveraging vectorization.flash_attn_ext_f16_thread) to use the newhvx_dot_f16_f16_aa_rx32function, replacing the looped calls to the old function and removing the need for temporary arrays.Vector Reduction Utilities
hvx_vec_reduce_sum_f32x4utility inhvx-reduce.hfor both HVX architectures, enabling efficient reduction of four HVX vector results into a single vector. This supports the new parallel dot product functions. [1] [2]Performance
// TODO