Add scale_diag_mask_inf_softmax operation for transformer attention#15738
Add scale_diag_mask_inf_softmax operation for transformer attention#15738Arya-Hari wants to merge 1 commit intoggml-org:masterfrom
Conversation
|
This is just a combination of the 3 existing ops, right? Seems like we could handle this with fusion and not need a new op. |
|
@jeffbolznv Operator fusion what I was going for really, but got confused on the way. Any suggestion on how to implement this with operator fusion? I'm new to this kind of work and don't have much experience, so any advice would be helpful. Thanks! |
|
Which backend(s) do you want to do it in? If you search for "can_fuse" you can find some examples of fusion in several backends. What kind of models have this pattern? I haven't seen it before. |
|
@jeffbolznv I'm actually trying to do the same thing done here, then to try and replicate for a Vulkan backend to run on an Adreno 750 GPU on an Android phone. |
|
Does the Vulkan backend currently run on your phone? We've had mixed reports in the past (running into various driver bugs, it seems). The Vulkan backend does have some support for fusion. To implement this, you'd probably need to add code to the soft max shader to conditionally apply the scale+diag_mask, and add the host side logic to select that shader. Is this combination of ops still interesting now that flash attention is broadly supported and is the default? |
|
@jeffbolznv Vulkan backend does run on the phone, although the metrics aren't great. It has slower decode rates than using CPU alone for most LLMs. So that's why I've been trying to improve it in some sense. OpenCL also works, better than Vulkan. As for flash attention, is it supported for Qualcomm architectures? Would something like this be possible to test on a mobile Adreno 750? |
|
Flash attention ought to work on all devices. Please try |
Make sure to read the contributing guidelines before submitting a PR