-
Notifications
You must be signed in to change notification settings - Fork 78
[Question]: Potential chance for better attention kernel #140
Copy link
Copy link
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Describe the issue
Thank you again for your work.
| for head in range(query_states.size(1)): |
I think kernel for each head is executed sequentially:
- single-head is hard to activate all performance of GPU
- inter-head warps miss the overlap chance, which is important for GPU SIMT
Have you thought about improve the kernel to realize the parallelism of heads. I think that the head loop is ok if the context length is super long enough (maybe). However, the speedup is limited when sequence is not super long, like 60k, as the baseline flash attention compute head in parallel.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested