[ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms #13844

SageMoore · 2025-02-25T19:33:05Z

flash_attn_varlen_func in upstream flash-attn does not support the return_softmax_lse argument. This PR works around that issue by explicitly disabling chunked prefill and prefix caching on non-cuda platforms.

Here are the results from running llmeval with deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct on an AMD system:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7657|±  |0.0117|
|     |       |strict-match    |     5|exact_match|↑  |0.7453|±  |0.0120|

Here are the results from the same model on an H100 system without this PR

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7642|±  |0.0117|
|     |       |strict-match    |     5|exact_match|↑  |0.7468|±  |0.0120|

Signed-off-by: Sage Moore <[email protected]>

github-actions · 2025-02-25T19:33:16Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Sage Moore <[email protected]>

mgoin

Nice work, LGTM

…-cuda platforms (vllm-project#13844) Signed-off-by: Sage Moore <[email protected]>

…-cuda platforms (vllm-project#13844) Signed-off-by: Sage Moore <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

…-cuda platforms (vllm-project#13844) Signed-off-by: Sage Moore <[email protected]>

SageMoore added 2 commits February 25, 2025 18:52

init

07336d2

Signed-off-by: Sage Moore <[email protected]>

init

c226a30

Signed-off-by: Sage Moore <[email protected]>

update logs

ae3594e

Signed-off-by: Sage Moore <[email protected]>

mgoin added rocm Related to AMD ROCm ready ONLY add when PR is ready to merge/full CI is needed labels Feb 25, 2025

mgoin approved these changes Feb 25, 2025

View reviewed changes

gshtras mentioned this pull request Feb 25, 2025

Upstream merge 25 02 24 ROCm/vllm#449

Merged

DarkLight1337 merged commit 1d35662 into vllm-project:main Feb 26, 2025
59 checks passed

SageMoore deleted the sage/deepseek-rocm-fix branch March 3, 2025 15:18

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

[ROCm] Disable chunked prefill/prefix caching when running MLA on non…

021ef73

…-cuda platforms (vllm-project#13844) Signed-off-by: Sage Moore <[email protected]>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[ROCm] Disable chunked prefill/prefix caching when running MLA on non…

3acf03b

…-cuda platforms (vllm-project#13844) Signed-off-by: Sage Moore <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms #13844

[ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms #13844

Uh oh!

SageMoore commented Feb 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 25, 2025

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms #13844

[ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms #13844

Uh oh!

Conversation

SageMoore commented Feb 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 25, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SageMoore commented Feb 25, 2025 •

edited by github-actions bot

Loading