Skip to content

Conversation

@vasqu
Copy link
Contributor

@vasqu vasqu commented Jan 5, 2026

As per title, oftentimes we have somethin along config._attn_implementation == "flash_attention_2". This is unideal and very faulty as other FA flavors will be ignored, e.g. kernels and FA3. This simply changes this to the generalized version everywhere.
Additionally, fixed a few wrong flags that still used _supports_flash_attn_2 (including within tests), relevant commit a231a2b

Note that sam2 video, sam3 tracker video, edgetam video do not have any FA tests but they only use a mask with target guided attention which is not used per default, so we add a fallback similar to sam3 to use FA for other attention blocks.

Note that I changed pixtral slightly more due to it doing unnecessary movements at each attention layer. Not sure why the mask was properly done but not the pos ids.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@vasqu vasqu changed the title [FA] Generalize fa config checks [FA] Generalize fa config checks and fix flags Jan 6, 2026
@vasqu
Copy link
Contributor Author

vasqu commented Jan 6, 2026

run-slow: pixtral

@vasqu vasqu marked this pull request as ready for review January 6, 2026 19:10
@vasqu vasqu requested review from Cyrilvallez and removed request for Rocketknight1 January 6, 2026 19:11
@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

This comment contains run-slow, running the specified jobs:

models: ["models/pixtral"]
quantizations: []

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

@huggingface huggingface deleted a comment from github-actions bot Jan 7, 2026
Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very needed indeed, now that we have many flavors!
Even if it's probably alright to keep a small check like this, wdyt about introducing a small helper is_flash_implementation (probably in import_utils.py along is_tracing etc?) or something? Would probably make it clearer/more maintainable if we keep adding flavors!

@Cyrilvallez
Copy link
Member

Sounds to me like it's quite important to have a unique entry-point for these cases, as otherwise we have the risk to have very surprising bugs/issues in just a few models because they simply miss some checks

@vasqu
Copy link
Contributor Author

vasqu commented Jan 12, 2026

Yes let me change the check to a separate fn, will probably go over more models then (which already do the proper "flash" in ... check)

@vasqu vasqu requested a review from Cyrilvallez January 13, 2026 16:55
@vasqu
Copy link
Contributor Author

vasqu commented Jan 13, 2026

Moved it to utils.generic (import utils doesn't fit IMO), but added this fn to all models now. Should make this way more maintainable

Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks a lot! It will truly make it easier!

Comment on lines +177 to +185
if is_flash_attention_requested(self.config) and attention_similarity is not None:
# Target guided masks are represented as float masks and are incompatible with Flash Attention
# Fallback to SDPA for this call only so the rest of the model can still benefit from FA
attention_interface = ALL_ATTENTION_FUNCTIONS["sdpa"]
logger.warning_once(
"Falling back to SDPA for target-guided attention because "
"Flash Attention does not support additive bias masks."
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was not there before, but trusting you on this one!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this with @yonigozlan internally. This feature seems to be rarely used (it should be None in most cases) + these models often have several different Attention mechanisms; we want to fallback here to support FA partially at least (similarly done in SAM3)

Comment on lines 215 to 217
Checks whether some flavor of flash attention is requested or not.
Priority order first goes for any explicitly passed value `requested_attention_implementation` and
then checks the config's saved value `config._attn_implementation`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's just raise if both are provided no? As it does not make sense to give both

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea makes sense, changed the logic and description to force to pass just one not both

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: afmoe, altclip, auto, autoformer, bamba, bark, bloom, clipseg, codegen, deepseek_v2, deepseek_v3, edgetam, edgetam_video, ernie4_5_vl_moe, falcon, falcon_h1

@github-actions
Copy link
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43121&sha=6c5129

@vasqu vasqu merged commit 9eea1b0 into huggingface:main Jan 15, 2026
25 checks passed
@vasqu vasqu deleted the fix-general-fa-checks branch January 15, 2026 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants