-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[megatron] feat: use flash as default attention_backend #3578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the default attention backend to 'flash' for consistency and introduces a utility function to map string values for attention_backend to the corresponding Megatron enum. The changes are applied across several configuration files and in the worker initialization logic. My review focuses on improving the implementation of the new utility function for better readability and conciseness. Overall, the changes align with the stated goals of the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will switching to flash as default backend have potential break for some training recipes? also does it have any performance/efficiency regressions or boost?
@ccclyu There is no performance difference compared with the previous default I am still trying to find out what is the root cause and whether the crash is fixed with new versions of transformer engine(TE2.7). |
) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. 1. add mapping from string to megatron's enum for attention_backend choice. 2. use flash attention as default attention_backend, for consistency to FSDP
) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. 1. add mapping from string to megatron's enum for attention_backend choice. 2. use flash attention as default attention_backend, for consistency to FSDP
) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. 1. add mapping from string to megatron's enum for attention_backend choice. 2. use flash attention as default attention_backend, for consistency to FSDP
) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. 1. add mapping from string to megatron's enum for attention_backend choice. 2. use flash attention as default attention_backend, for consistency to FSDP
What does this PR do?