[Question]: Potential chance for better attention kernel

### Describe the issue

Thank you again for your work.

https://github.com/microsoft/MInference/blob/99d18c9070756794c0d6f77f8d3b12232fb45675/minference/modules/minference_forward.py#L640

I think kernel for each head is executed sequentially:
1. single-head is hard to activate all performance of GPU
2. inter-head warps miss the overlap chance, which is important for GPU SIMT

Have you thought about improve the kernel to realize the parallelism of heads. I think that the head loop is ok if the context length is super long enough (maybe). However, the speedup is limited when sequence is not super long, like 60k, as the baseline flash attention compute head in parallel.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Potential chance for better attention kernel #140

Describe the issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question]: Potential chance for better attention kernel #140

Description

Describe the issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions