-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Closed
Description
Really love this repo, I've been using it to finetune CodeGen models with >2k context windows.
It's way faster than hugging face (3x) and slightly faster than Megatron for the 350M and 2.7b parameter CodeGen models but doesn't work for the 6.1B and 16B parameter models as they have a head dimension of 256.
I would imagine CodeGen finetuning will be a solid use-case for flash attention since coding models can really benefit from long context windows. And CodeGen is basically SOTA for coding (competitive with Codex).
Is this something that is even possible with flash attention?
Metadata
Metadata
Assignees
Labels
No labels
