[FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. #14664

tjtanaa · 2025-03-12T08:10:44Z

Description

This PR add the logic to enable ENCODER_ONLY model support to ROCm Flash Attention. Thus, enabling language embedding models and some vision language embedding models.

FIX #14062
FIX #14583

File changes:

vllm/attention/backends/rocm_flash_attn.py: Add ENCODER_ONLY code path
tests/models/embedding/language/test_embedding.py: Fix the code logic to enable unit tests for ROCm support. Re-group the embedding model to their correct category.
tests/models/embedding/language/test_cls_models.py: Only test "half" datatype for ROCm support as ROCm Flash Attention does not support Float32 attention.
tests/models/embedding/language/test_gritlm.py: Fix the bug where the tests are not skipped even if xformers package is not installed.

Tests

Locally ran tests:
- tests/models/embedding/language [Passed]
- tests/models/embedding/vision_language/test_dse_qwen2_vl.py [Passed]
- tests/models/embedding/vision_language/test_llava_next.py. This test is skipped as this HF model only works on CUDA, failed on CPU and ROCm.
- tests/models/embedding/vision_language/test_phi3v.py failed because of this BUG [Bug]: Unit test tests/models/embedding/vision_language/test_phi3v.py failing #14677
Since there is a change in the ROCm Flash Attention module, I have also ensure all the unittests below passed
- tests/models/decoder_only/language/test_models.py [Passed]

Experimental Results:

The same embedding model is run on A100, L40 and MI300X, the embedding output is compared:

Model launch command: VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve Alibaba-NLP/gte-Qwen2-7B-instruct --dtype float16 --max-num-seqs 512 --gpu-memory-utilization 0.95 --hf-overrides '{"is_causal": false}' --trust-remote-code

Comparison of Embedding Model Performance Across Different Hardware

Comparison	Mean Difference	Abs Mean	Std	Min	Max	Cosine Similarity	L2 Norm Avg	% Dims with Rel Diff >1%
A100 vs L40	4.42e-07	3.39e-05	4.51e-05	-4.58e-04	3.66e-04	0.999996	0.00270	16.95%
A100 vs MI300X	9.57e-08	3.22e-05	4.31e-05	-3.36e-04	3.05e-04	0.999997	0.00256	15.33%
L40 vs MI300X	-3.46e-07	3.28e-05	4.35e-05	-2.75e-04	3.66e-04	0.999997	0.00257	15.64%
MI300X Triton vs CK	-6.02e-07	2.78e-05	3.69e-05	-2.44e-04	3.05e-04	0.999998	0.00220	13.74%

Note: All p-values from paired t-tests were >0.05, indicating no statistically significant differences between platforms.

Things to future PR

In the unit tests tests/models/embedding/language/test_embedding.py, if I ran
pytest.param("ssmits/Qwen2-7B-Instruct-embed-base"), directly after
pytest.param("Alibaba-NLP/gte-Qwen2-7B-instruct"), the "ssmits/Qwen2-7B-Instruct-embed-base" tests will fail.
e.g.

       # [Encoder-only]
       pytest.param("BAAI/bge-base-en-v1.5",
                    marks=[pytest.mark.core_model, pytest.mark.cpu_model]),
       pytest.param("sentence-transformers/all-MiniLM-L12-v2"),
       pytest.param("intfloat/multilingual-e5-small"),
       # [Decoder-only]
       pytest.param("BAAI/bge-multilingual-gemma2",
                    marks=[pytest.mark.core_model]),
       pytest.param("intfloat/e5-mistral-7b-instruct",
                    marks=[pytest.mark.core_model, pytest.mark.cpu_model]),
       pytest.param("Alibaba-NLP/gte-Qwen2-1.5B-instruct"),
       pytest.param("Alibaba-NLP/gte-Qwen2-7B-instruct"),
       pytest.param("ssmits/Qwen2-7B-Instruct-embed-base"),
       # [Cross-Encoder]
       pytest.param("sentence-transformers/stsb-roberta-base-v2"),

Error message snippet:

>           assert sim >= 1 - tol, fail_msg                                                                                     
E           AssertionError: Test0:                                                                                              
E           hf: array([-0.3716, -0.7935, -1.015 ,  1.553 , -0.0258, -1.482 , -1.167 ,                                           
E                  -1.301 ,  1.342 , -5.906 ,  0.797 , -1.126 ,  1.018 , -0.726 ,                                               
E                   2.041 ,  0.316 ], dtype=float16)                                                                            
E           vllm:       [-0.0032596588134765625, -7.50422477722168e-05, -0.0025844573974609375, 0.00513458251953125, -0.00027632
713317871094, -0.012969970703125, -0.00785064697265625, -0.00406646728515625, -0.0018939971923828125, -0.043060302734375, 0.0105
8197021484375, 0.0023040771484375, 0.0010004043579101562, -0.00409698486328125, 0.0022144317626953125, -0.0022106170654296875]  
                                                                                                                                
tests/models/embedding/utils.py:32: AssertionError

It might be due to the fact that the environment or resource are not deallocated correctly between unit tests run.

Signed-off-by: tjtanaa <[email protected]>

github-actions · 2025-03-12T08:10:53Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-03-12T08:11:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tjtanaa.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tests/models/embedding/language/test_embedding.py

vllm/attention/backends/rocm_flash_attn.py

Signed-off-by: tjtanaa <[email protected]>

DarkLight1337

Thanks for implementing this!

…ash Attention to enable embedding models. (vllm-project#14664) Signed-off-by: tjtanaa <[email protected]> Signed-off-by: Richard Liu <[email protected]>

…ash Attention to enable embedding models. (vllm-project#14664) Signed-off-by: tjtanaa <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

…ash Attention to enable embedding models. (vllm-project#14664) Signed-off-by: tjtanaa <[email protected]>

…ash Attention to enable embedding models. (vllm-project#14664) Signed-off-by: tjtanaa <[email protected]> Signed-off-by: Mu Huai <[email protected]>

tjtanaa added 3 commits March 11, 2025 17:06

add encoder only support

7014232

Signed-off-by: tjtanaa <[email protected]>

fix language embedding unittest for ROCm

a50971f

Signed-off-by: tjtanaa <[email protected]>

fix language embedding classification unittest for ROCm

a692621

Signed-off-by: tjtanaa <[email protected]>

mergify bot added the needs-rebase label Mar 12, 2025

tjtanaa mentioned this pull request Mar 12, 2025

[Bug]: Alibaba-NLP/gte-Qwen2-7B-instruct on AMD MI300X #14062

Closed

1 task

tjtanaa changed the title ~~[FEAT] [ROCm] Add encoder-only model support into ROCm Flash Attention to enable language embedding.~~ [FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable language embedding. Mar 12, 2025

DarkLight1337 reviewed Mar 12, 2025

View reviewed changes

tests/models/embedding/language/test_embedding.py Show resolved Hide resolved

DarkLight1337 reviewed Mar 12, 2025

View reviewed changes

vllm/attention/backends/rocm_flash_attn.py Outdated Show resolved Hide resolved

merge changes from main; fix embedding/vision_language unittests

195ccd3

Signed-off-by: tjtanaa <[email protected]>

mergify bot added the ci/build label Mar 12, 2025

tjtanaa changed the title ~~[FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable language embedding.~~ [FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. Mar 12, 2025

merge changes from main

47156b5

Signed-off-by: tjtanaa <[email protected]>

tjtanaa marked this pull request as ready for review March 12, 2025 12:35

tjtanaa requested review from tlrmchlsmth and ywang96 as code owners March 12, 2025 12:35

mergify bot removed the needs-rebase label Mar 12, 2025

fix requirements/test/txt linting

4b5d587

Signed-off-by: tjtanaa <[email protected]>

DarkLight1337 approved these changes Mar 12, 2025

View reviewed changes

DarkLight1337 added the rocm Related to AMD ROCm label Mar 12, 2025

DarkLight1337 enabled auto-merge (squash) March 12, 2025 13:09

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 12, 2025

vllm-bot merged commit 916836b into vllm-project:main Mar 12, 2025
69 of 72 checks passed

tjtanaa deleted the add-encoder-only branch March 14, 2025 12:15

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Fl…

d23ab1e

…ash Attention to enable embedding models. (vllm-project#14664) Signed-off-by: tjtanaa <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. #14664

[FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. #14664

Uh oh!

tjtanaa commented Mar 12, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 12, 2025

Uh oh!

mergify bot commented Mar 12, 2025

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. #14664

[FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. #14664

Uh oh!

Conversation

tjtanaa commented Mar 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

File changes:

Tests

Experimental Results:

Comparison of Embedding Model Performance Across Different Hardware

Things to future PR

Uh oh!

github-actions bot commented Mar 12, 2025

Uh oh!

mergify bot commented Mar 12, 2025

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tjtanaa commented Mar 12, 2025 •

edited by github-actions bot

Loading