[ROCm][MLA] Enable MLA persistent kernel with fp8 and bf16 support #27380

ganyi1996ppo · 2025-10-23T01:56:03Z

Purpose

This PR introduce persistent mla kernel implementation for AiterMLABackend, so as the fp8 support.

Test Plan

Verify the accuracy on gsm8k

Test Result

# bf16 acc result on gsm8k
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9522|±  |0.0059|
|     |       |strict-match    |     5|exact_match|↑  |0.9507|±  |0.0060|

# fp8 acc result on gsm8k
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.953|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.953|±  |0.0058|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

vllmellm · 2025-11-17T11:38:07Z

vllm/v1/attention/backends/mla/rocm_aiter_mla.py

+        gpu = torch.cuda.current_device()
+        device_properties = torch.cuda.get_device_properties(gpu)
+        cu_num = device_properties.multi_processor_count
+


there is no need for the lines 100-102 the metadatas can be created using the aiter.get_mla_metadata_info_v1
which returns the sizes needed for each metadata tensors with the respective dtypes
also the persistent mode supports different head sizes which allows us to remove restriction of num_heads from 16 and 128 by using the same condition in aiter at this line
so instead of always using persistent mode we can make it conditional to num_heads.
which means we can remove these line:

vllm/vllm/v1/attention/backends/mla/rocm_aiter_mla.py

Lines 228 to 232 in d4acf51

assert num_heads == 16 or num_heads == 128, (

f"Aiter MLA only supports 16 or 128 number of heads.\n"

f"Provided {num_heads} number of heads.\n"

"Try adjusting tensor_parallel_size value."

)

This way we support both persistent and non persistent mode while we can run deepseek-R1 model on tp4 since persistent mode supports a flexible num_heads

so can use similar logic as following.

self.persistent_mode = False num_heads = self.num_heads if num_heads == 16 or num_heads in range(32, 513, 16): self.persistent_mode = True if self.persistent_mode: import aiter ( (work_meta_data_size, w_dtype), (work_indptr_size, w_indptr_dtype), (work_info_set_size, w_info_set_dtype), (reduce_indptr_size, r_indptr_dtype), (reduce_final_map_size, r_final_map_dtype), (reduce_partial_map_size, r_partial_map_dtype), ) = aiter.get_mla_metadata_info_v1( max_num_reqs, 1, # mtp=1 self.num_heads, vllm_config.model_config.dtype, kv_cache_spec.dtype, is_sparse=False, fast_mode=True, ) self.work_meta_data = torch.empty( work_meta_data_size, dtype=w_dtype, device=device ) self.work_indptr = torch.empty( work_indptr_size, dtype=w_indptr_dtype, device=device ) self.work_info_set = torch.empty( work_info_set_size, dtype=w_info_set_dtype, device=device ) self.reduce_indptr = torch.empty( reduce_indptr_size, dtype=r_indptr_dtype, device=device ) self.reduce_final_map = torch.empty( reduce_final_map_size, dtype=r_final_map_dtype, device=device ) self.reduce_partial_map = torch.empty( reduce_partial_map_size, dtype=r_partial_map_dtype, device=device )

Thanks for the suggestion and sorry for the delay. But from what I learned from aiter, persistent kernel is mainly aim to deal the various length request case and non-persistent is mainly for the case with more similar length. So instead of judging by head, maybe we can leave it as an env variable to the user?

vllmellm · 2025-11-17T11:39:54Z

vllm/v1/attention/backends/mla/rocm_aiter_mla.py

+        max_seqlen_qo = torch.max(query_lens).item()
+
+        import aiter
+


as suggested above in the case we would want to use self.persistent_mode then we have the code block below. so that we still support non persistent mode.

wuhuikx · 2025-11-18T05:34:57Z

Blocked by this issue ROCm/aiter#1420

Signed-off-by: ganyi <[email protected]>

mergify bot added rocm Related to AMD ROCm v1 labels Oct 23, 2025

ganyi1996ppo mentioned this pull request Oct 23, 2025

[WIP] Support persistent MLA for ROCm MLA backend ROCm/vllm#739

Open

5 tasks

wuhuikx mentioned this pull request Oct 23, 2025

[Performance]: Deepseek-V3 Performance Uplift Plan on ROCm Backend #26768

Open

30 tasks

vllmellm reviewed Nov 17, 2025

View reviewed changes

zejunchen-zejun mentioned this pull request Nov 20, 2025

[ROCm][MLA] Enable MLA persistent kernel with fp8 and bf16 support ROCm/vllm#817

Open

ganyi1996ppo added 5 commits November 27, 2025 06:53

enable persistent mla kernel

0d041f3

Signed-off-by: ganyi <[email protected]>

fix lint

2872019

Signed-off-by: ganyi <[email protected]>

add QueryLenSupport

8f9f7c3

Signed-off-by: ganyi <[email protected]>

cudagraph support to uniform_batch

97e7182

Signed-off-by: ganyi <[email protected]>

mtp

bdb40c3

Signed-off-by: ganyi <[email protected]>

ganyi1996ppo force-pushed the ganyi/rocm_fp8_mla_and_persistent_kernel branch from f23e28b to bdb40c3 Compare November 27, 2025 07:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ROCm][MLA] Enable MLA persistent kernel with fp8 and bf16 support #27380

[ROCm][MLA] Enable MLA persistent kernel with fp8 and bf16 support #27380

ganyi1996ppo commented Oct 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

vllmellm Nov 17, 2025

Uh oh!

ganyi1996ppo Nov 27, 2025

Uh oh!

vllmellm Nov 17, 2025

Uh oh!

wuhuikx commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	assert num_heads == 16 or num_heads == 128, (
	f"Aiter MLA only supports 16 or 128 number of heads.\n"
	f"Provided {num_heads} number of heads.\n"
	"Try adjusting tensor_parallel_size value."
	)

Uh oh!

[ROCm][MLA] Enable MLA persistent kernel with fp8 and bf16 support #27380

Are you sure you want to change the base?

[ROCm][MLA] Enable MLA persistent kernel with fp8 and bf16 support #27380

Conversation

ganyi1996ppo commented Oct 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

vllmellm Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

vllmellm Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

wuhuikx commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ganyi1996ppo commented Oct 23, 2025 •

edited by github-actions bot

Loading