Support for 256 head dim

Really love this repo, I've been using it to finetune CodeGen models with >2k context windows.

It's way faster than hugging face (3x) and slightly faster than Megatron for the 350M and 2.7b parameter CodeGen models but doesn't work for the 6.1B and 16B parameter models as they have a head dimension of 256.

<img width="1006" alt="Screen Shot 2022-11-01 at 5 32 47 PM" src="https://user-images.githubusercontent.com/17725268/199345886-f8b7531e-9918-4fba-ab37-4ae980ec2796.png">

I would imagine CodeGen finetuning will be a solid use-case for flash attention since coding models can really benefit from long context windows. And CodeGen is basically SOTA for coding (competitive with Codex).

Is this something that is even possible with flash attention?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for 256 head dim #67

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for 256 head dim #67

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions