Add scale_diag_mask_inf_softmax operation for transformer attention by Arya-Hari · Pull Request #15738 · ggml-org/llama.cpp

Arya-Hari · 2025-09-02T10:31:20Z

Make sure to read the contributing guidelines before submitting a PR

jeffbolznv · 2025-09-02T20:16:11Z

This is just a combination of the 3 existing ops, right? Seems like we could handle this with fusion and not need a new op.

Arya-Hari · 2025-09-03T02:31:19Z

@jeffbolznv Operator fusion what I was going for really, but got confused on the way. Any suggestion on how to implement this with operator fusion? I'm new to this kind of work and don't have much experience, so any advice would be helpful. Thanks!

jeffbolznv · 2025-09-03T03:16:32Z

Which backend(s) do you want to do it in?

If you search for "can_fuse" you can find some examples of fusion in several backends.

What kind of models have this pattern? I haven't seen it before.

Arya-Hari · 2025-09-03T03:56:33Z

@jeffbolznv I'm actually trying to do the same thing done here, then to try and replicate for a Vulkan backend to run on an Adreno 750 GPU on an Android phone.

jeffbolznv · 2025-09-03T04:05:13Z

Does the Vulkan backend currently run on your phone? We've had mixed reports in the past (running into various driver bugs, it seems).

The Vulkan backend does have some support for fusion. To implement this, you'd probably need to add code to the soft max shader to conditionally apply the scale+diag_mask, and add the host side logic to select that shader.

Is this combination of ops still interesting now that flash attention is broadly supported and is the default?

Arya-Hari · 2025-09-03T04:45:49Z

@jeffbolznv Vulkan backend does run on the phone, although the metrics aren't great. It has slower decode rates than using CPU alone for most LLMs. So that's why I've been trying to improve it in some sense. OpenCL also works, better than Vulkan.

As for flash attention, is it supported for Qualcomm architectures? Would something like this be possible to test on a mobile Adreno 750?

jeffbolznv · 2025-09-03T13:25:43Z

Flash attention ought to work on all devices. Please try test-backend-ops -o FLASH_ATTN_EXT

Add scale_diag_mask_inf_softmax operation for transformer attention

65935f4

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Sep 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scale_diag_mask_inf_softmax operation for transformer attention#15738

Add scale_diag_mask_inf_softmax operation for transformer attention#15738
Arya-Hari wants to merge 1 commit intoggml-org:masterfrom
Arya-Hari:feature/scale-diag-mask-inf-softmax

Arya-Hari commented Sep 2, 2025

Uh oh!

jeffbolznv commented Sep 2, 2025

Uh oh!

Arya-Hari commented Sep 3, 2025

Uh oh!

jeffbolznv commented Sep 3, 2025

Uh oh!

Arya-Hari commented Sep 3, 2025 •

edited

Loading

Uh oh!

jeffbolznv commented Sep 3, 2025

Uh oh!

Arya-Hari commented Sep 3, 2025 •

edited

Loading

Uh oh!

jeffbolznv commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Arya-Hari commented Sep 2, 2025

Uh oh!

jeffbolznv commented Sep 2, 2025

Uh oh!

Arya-Hari commented Sep 3, 2025

Uh oh!

jeffbolznv commented Sep 3, 2025

Uh oh!

Arya-Hari commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Sep 3, 2025

Uh oh!

Arya-Hari commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Arya-Hari commented Sep 3, 2025 •

edited

Loading

Arya-Hari commented Sep 3, 2025 •

edited

Loading