Skip to content

Support for 256 head dim #67

@Sanger2000

Description

@Sanger2000

Really love this repo, I've been using it to finetune CodeGen models with >2k context windows.

It's way faster than hugging face (3x) and slightly faster than Megatron for the 350M and 2.7b parameter CodeGen models but doesn't work for the 6.1B and 16B parameter models as they have a head dimension of 256.

Screen Shot 2022-11-01 at 5 32 47 PM

I would imagine CodeGen finetuning will be a solid use-case for flash attention since coding models can really benefit from long context windows. And CodeGen is basically SOTA for coding (competitive with Codex).

Is this something that is even possible with flash attention?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions