Skip to content

Conversation

@ISEEKYAN
Copy link
Collaborator

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

  1. add mapping from string to megatron's enum for attention_backend choice.
  2. use flash attention as default attention_backend, for consistency to FSDP

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the default attention backend to 'flash' for consistency and introduces a utility function to map string values for attention_backend to the corresponding Megatron enum. The changes are applied across several configuration files and in the worker initialization logic. My review focuses on improving the implementation of the new utility function for better readability and conciseness. Overall, the changes align with the stated goals of the PR.

Copy link
Collaborator

@ccclyu ccclyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will switching to flash as default backend have potential break for some training recipes? also does it have any performance/efficiency regressions or boost?

@ISEEKYAN
Copy link
Collaborator Author

will switching to flash as default backend have potential break for some training recipes? also does it have any performance/efficiency regressions or boost?

@ccclyu There is no performance difference compared with the previous default TE.fused_attn. But in some cases (DAPO with 20k response with qwen3 14B/30B) I found the fused_attn will make the training crash after a long training (32H207days), while no crash found when flash_attn is used. with TE2.2.

I am still trying to find out what is the root cause and whether the crash is fixed with new versions of transformer engine(TE2.7).

@vermouth1992 vermouth1992 merged commit f1d212c into volcengine:main Sep 24, 2025
65 of 69 checks passed
masoudhashemi pushed a commit to masoudhashemi/verl that referenced this pull request Oct 19, 2025
)

### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

1. add mapping from string to megatron's enum for attention_backend
choice.
2. use flash attention as default attention_backend, for consistency to
FSDP
techkang pushed a commit to techkang/verl that referenced this pull request Oct 31, 2025
)

### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

1. add mapping from string to megatron's enum for attention_backend
choice.
2. use flash attention as default attention_backend, for consistency to
FSDP
mtian8 pushed a commit to mtian8/verl that referenced this pull request Nov 1, 2025
)

### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

1. add mapping from string to megatron's enum for attention_backend
choice.
2. use flash attention as default attention_backend, for consistency to
FSDP
wangboxiong320 pushed a commit to wangboxiong320/verl that referenced this pull request Nov 1, 2025
)

### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

1. add mapping from string to megatron's enum for attention_backend
choice.
2. use flash attention as default attention_backend, for consistency to
FSDP
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants