Skip to content

Conversation

@tjtanaa
Copy link
Collaborator

@tjtanaa tjtanaa commented Mar 12, 2025

Description

This PR add the logic to enable ENCODER_ONLY model support to ROCm Flash Attention. Thus, enabling language embedding models and some vision language embedding models.

FIX #14062
FIX #14583

File changes:

  • vllm/attention/backends/rocm_flash_attn.py: Add ENCODER_ONLY code path
  • tests/models/embedding/language/test_embedding.py: Fix the code logic to enable unit tests for ROCm support. Re-group the embedding model to their correct category.
  • tests/models/embedding/language/test_cls_models.py: Only test "half" datatype for ROCm support as ROCm Flash Attention does not support Float32 attention.
  • tests/models/embedding/language/test_gritlm.py: Fix the bug where the tests are not skipped even if xformers package is not installed.

Tests

  • Locally ran tests:

  • Since there is a change in the ROCm Flash Attention module, I have also ensure all the unittests below passed

    • tests/models/decoder_only/language/test_models.py [Passed]

Experimental Results:

The same embedding model is run on A100, L40 and MI300X, the embedding output is compared:

Model launch command: VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve Alibaba-NLP/gte-Qwen2-7B-instruct --dtype float16 --max-num-seqs 512 --gpu-memory-utilization 0.95 --hf-overrides '{"is_causal": false}' --trust-remote-code

Comparison of Embedding Model Performance Across Different Hardware

Comparison Mean Difference Abs Mean Std Min Max Cosine Similarity L2 Norm Avg % Dims with Rel Diff >1%
A100 vs L40 4.42e-07 3.39e-05 4.51e-05 -4.58e-04 3.66e-04 0.999996 0.00270 16.95%
A100 vs MI300X 9.57e-08 3.22e-05 4.31e-05 -3.36e-04 3.05e-04 0.999997 0.00256 15.33%
L40 vs MI300X -3.46e-07 3.28e-05 4.35e-05 -2.75e-04 3.66e-04 0.999997 0.00257 15.64%
MI300X Triton vs CK -6.02e-07 2.78e-05 3.69e-05 -2.44e-04 3.05e-04 0.999998 0.00220 13.74%

Note: All p-values from paired t-tests were >0.05, indicating no statistically significant differences between platforms.

Things to future PR

In the unit tests tests/models/embedding/language/test_embedding.py, if I ran
pytest.param("ssmits/Qwen2-7B-Instruct-embed-base"), directly after
pytest.param("Alibaba-NLP/gte-Qwen2-7B-instruct"), the "ssmits/Qwen2-7B-Instruct-embed-base" tests will fail.
e.g.

       # [Encoder-only]
       pytest.param("BAAI/bge-base-en-v1.5",
                    marks=[pytest.mark.core_model, pytest.mark.cpu_model]),
       pytest.param("sentence-transformers/all-MiniLM-L12-v2"),
       pytest.param("intfloat/multilingual-e5-small"),
       # [Decoder-only]
       pytest.param("BAAI/bge-multilingual-gemma2",
                    marks=[pytest.mark.core_model]),
       pytest.param("intfloat/e5-mistral-7b-instruct",
                    marks=[pytest.mark.core_model, pytest.mark.cpu_model]),
       pytest.param("Alibaba-NLP/gte-Qwen2-1.5B-instruct"),
       pytest.param("Alibaba-NLP/gte-Qwen2-7B-instruct"),
       pytest.param("ssmits/Qwen2-7B-Instruct-embed-base"),
       # [Cross-Encoder]
       pytest.param("sentence-transformers/stsb-roberta-base-v2"),

Error message snippet:

>           assert sim >= 1 - tol, fail_msg                                                                                     
E           AssertionError: Test0:                                                                                              
E           hf: array([-0.3716, -0.7935, -1.015 ,  1.553 , -0.0258, -1.482 , -1.167 ,                                           
E                  -1.301 ,  1.342 , -5.906 ,  0.797 , -1.126 ,  1.018 , -0.726 ,                                               
E                   2.041 ,  0.316 ], dtype=float16)                                                                            
E           vllm:       [-0.0032596588134765625, -7.50422477722168e-05, -0.0025844573974609375, 0.00513458251953125, -0.00027632
713317871094, -0.012969970703125, -0.00785064697265625, -0.00406646728515625, -0.0018939971923828125, -0.043060302734375, 0.0105
8197021484375, 0.0023040771484375, 0.0010004043579101562, -0.00409698486328125, 0.0022144317626953125, -0.0022106170654296875]  
                                                                                                                                
tests/models/embedding/utils.py:32: AssertionError

It might be due to the fact that the environment or resource are not deallocated correctly between unit tests run.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify
Copy link

mergify bot commented Mar 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tjtanaa.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 12, 2025
@tjtanaa tjtanaa changed the title [FEAT] [ROCm] Add encoder-only model support into ROCm Flash Attention to enable language embedding. [FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable language embedding. Mar 12, 2025
@mergify mergify bot added the ci/build label Mar 12, 2025
@tjtanaa tjtanaa changed the title [FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable language embedding. [FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. Mar 12, 2025
@tjtanaa tjtanaa marked this pull request as ready for review March 12, 2025 12:35
@mergify mergify bot removed the needs-rebase label Mar 12, 2025
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this!

@DarkLight1337 DarkLight1337 added the rocm Related to AMD ROCm label Mar 12, 2025
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) March 12, 2025 13:09
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 12, 2025
@vllm-bot vllm-bot merged commit 916836b into vllm-project:main Mar 12, 2025
69 of 72 checks passed
@tjtanaa tjtanaa deleted the add-encoder-only branch March 14, 2025 12:15
richardsliu pushed a commit to richardsliu/vllm that referenced this pull request Mar 14, 2025
…ash Attention to enable embedding models. (vllm-project#14664)

Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
…ash Attention to enable embedding models. (vllm-project#14664)

Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: Louis Ulmer <[email protected]>
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
…ash Attention to enable embedding models. (vllm-project#14664)

Signed-off-by: tjtanaa <[email protected]>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
…ash Attention to enable embedding models. (vllm-project#14664)

Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: Mu Huai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

None yet

3 participants