[megatron] feat: use flash as default attention_backend #3578

ISEEKYAN · 2025-09-23T02:53:15Z

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

add mapping from string to megatron's enum for attention_backend choice.
use flash attention as default attention_backend, for consistency to FSDP

gemini-code-assist

Code Review

This pull request updates the default attention backend to 'flash' for consistency and introduces a utility function to map string values for attention_backend to the corresponding Megatron enum. The changes are applied across several configuration files and in the worker initialization logic. My review focuses on improving the implementation of the new utility function for better readability and conciseness. Overall, the changes align with the stated goals of the PR.

verl/models/mcore/config_converter.py

ccclyu

will switching to flash as default backend have potential break for some training recipes? also does it have any performance/efficiency regressions or boost?

ISEEKYAN · 2025-09-23T07:56:58Z

will switching to flash as default backend have potential break for some training recipes? also does it have any performance/efficiency regressions or boost?

@ccclyu There is no performance difference compared with the previous default TE.fused_attn. But in some cases (DAPO with 20k response with qwen3 14B/30B) I found the fused_attn will make the training crash after a long training (32H207days), while no crash found when flash_attn is used. with TE2.2.

I am still trying to find out what is the root cause and whether the crash is fixed with new versions of transformer engine(TE2.7).

) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. 1. add mapping from string to megatron's enum for attention_backend choice. 2. use flash attention as default attention_backend, for consistency to FSDP

use flash as default megatron attn

35ffbf1

ISEEKYAN requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners September 23, 2025 02:53

gemini-code-assist bot reviewed Sep 23, 2025

View reviewed changes

verl/models/mcore/config_converter.py Outdated Show resolved Hide resolved

update

ad84e9b

ccclyu reviewed Sep 23, 2025

View reviewed changes

vermouth1992 approved these changes Sep 24, 2025

View reviewed changes

vermouth1992 merged commit f1d212c into volcengine:main Sep 24, 2025
65 of 69 checks passed

ISEEKYAN mentioned this pull request Sep 24, 2025

megatron training crash when long term DAPO with qwen3-14b/qwen3-30b-a3b #3597

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[megatron] feat: use flash as default attention_backend #3578

[megatron] feat: use flash as default attention_backend #3578

ISEEKYAN commented Sep 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

ccclyu left a comment •

edited

Loading

Uh oh!

ISEEKYAN commented Sep 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[megatron] feat: use flash as default attention_backend #3578

[megatron] feat: use flash as default attention_backend #3578

Conversation

ISEEKYAN commented Sep 23, 2025

What does this PR do?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ccclyu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ISEEKYAN commented Sep 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ccclyu left a comment •

edited

Loading