Skip to content

[Question]: Potential chance for better attention kernel #140

@GuoYiFantastic

Description

@GuoYiFantastic

Describe the issue

Thank you again for your work.

for head in range(query_states.size(1)):

I think kernel for each head is executed sequentially:

  1. single-head is hard to activate all performance of GPU
  2. inter-head warps miss the overlap chance, which is important for GPU SIMT

Have you thought about improve the kernel to realize the parallelism of heads. I think that the head loop is ok if the context length is super long enough (maybe). However, the speedup is limited when sequence is not super long, like 60k, as the baseline flash attention compute head in parallel.

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions