[ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms #13844
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
flash_attn_varlen_funcin upstreamflash-attndoes not support thereturn_softmax_lseargument. This PR works around that issue by explicitly disabling chunked prefill and prefix caching on non-cuda platforms.Here are the results from running llmeval with
deepseek-ai/DeepSeek-Coder-V2-Lite-Instructon an AMD system:Here are the results from the same model on an H100 system without this PR