Bugfix: Attention computation is not equivalent if fused_attn is false #145

kandelak · 2025-05-16T12:52:46Z

There is a bug in calculation of the attention if fused_attn is set to false.

To replicate this bug, set fused_attn to false and you will get very poor reconstruction whereas if it is set to true (default), it works. With this change, it works again (reimplemented non-efficient fused_attention basically)

Possible reason: The training was done using F.scaled_dot_product_attention which is internally different from the "else branch" where attention calculation happens in a non-efficient way.

kandelak · 2025-10-18T13:40:38Z

To remind you. This problem came up also for other users (Here for instance: #149) and this PR solves it.

Bugfix: Attention computation is not equivalent if fused_attn is false

0528872

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bugfix: Attention computation is not equivalent if fused_attn is false #145

Bugfix: Attention computation is not equivalent if fused_attn is false #145

Uh oh!

kandelak commented May 16, 2025 •

edited

Loading

Uh oh!

kandelak commented Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bugfix: Attention computation is not equivalent if fused_attn is false #145

Are you sure you want to change the base?

Bugfix: Attention computation is not equivalent if fused_attn is false #145

Uh oh!

Conversation

kandelak commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kandelak commented Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kandelak commented May 16, 2025 •

edited

Loading