[Feature] Support Prefill Context Parallel (PCP) for GQA flashinfer #28723

pisceskkk · 2025-11-14T10:38:47Z

Purpose

This PR, splited from full PR #26864, adds the supports for the Prefill Context Parallelism (PCP) with GQA flashinfer, following PR #28718. For specific implementation details, please refer to the RFC #25749.
TL;DR: PCP enhances long-sequence inference capabilities by partitioning the sequence dimension during the prefill stage.

The current implementation primarily includes the following changes:

Modified ModelRunner.py for PCP partitioning logic for tokens;
Modified flashinfer.py to adapt the FlashInfer backend for GQA to PCP.
Added PrefillContextParallelMetadata shared across attention backends;
Renamed variables and functions shared by both PCP and DCP.

Test Plan

Qwen/Qwen2.5-3B

export VLLM_ATTENTION_BACKEND='FLASHINFER'
vllm serve Qwen/Qwen2.5-3B --tensor-parallel-size 4 --decode-context-parallel-size 2 --prefill-context-parallel-size 2 --dcp-kv-cache-interleave-size 8

Test Result

gsm8k eval

tp4 17c540a

dataset	version	metric	mode	vllm-api-general-stream
gsm8kdataset	-	avg@5	gen	72.78

tp4 dcp2 interleave 8

dataset	version	metric	mode	vllm-api-general-stream
gsm8kdataset	-	avg@5	gen	72.43

tp4 pcp2 interleave 8

dataset	version	metric	mode	vllm-api-general-stream
gsm8kdataset	-	avg@5	gen	72.51

tp4 dcp2 pcp2 interleave 8

dataset	version	metric	mode	vllm-api-general-stream
gsm8kdataset	-	avg@5	gen	72.98

CC @LookAround0301 @FENP @gjc0824 @LucasWilkinson

mergify · 2025-11-14T10:39:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @pisceskkk.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces support for Prefill Context Parallelism (PCP) for GQA with flashinfer, which is a significant feature for enhancing long-sequence inference. The changes are extensive, touching configuration, parallel state management, attention backends, and the model runner. Overall, the implementation looks solid, but I've identified a few critical issues that need to be addressed. These include a duplicated command-line argument, a syntax error, a typo in a variable name, and incorrect tensor indexing, all of which could lead to runtime errors or prevent the code from running.

vllm/v1/attention/backends/flashinfer.py

vllm/v1/worker/gpu_model_runner.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/config/parallel.py

vllm/v1/attention/backends/flashinfer.py

mergify · 2025-11-19T01:12:36Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @pisceskkk.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-11-19T21:34:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @pisceskkk.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Co-authored-by: QiuChunshuo <[email protected]> Co-authored-by: FENP <[email protected]> Co-authored-by: LookAround <[email protected]> Co-authored-by: Jingchun Gao <[email protected]> Co-authored-by: zhenwenqi2024 <[email protected]> Signed-off-by: QiuChunshuo <[email protected]> Signed-off-by: FENP <[email protected]> Signed-off-by: LookAround <[email protected]> Signed-off-by: Jingchun Gao <[email protected]> Signed-off-by: zhenwenqi2024 <[email protected]>

Signed-off-by: Jingchun Gao <[email protected]>

LucasWilkinson

Left some comments on #28988 which I think similarly apply here

LucasWilkinson · 2025-11-25T02:47:33Z

vllm/v1/worker/gpu_model_runner.py

+            self.is_mm_embed = self._make_buffer(max_num_tokens, dtype=torch.bool)
+
+        # Persistent buffers for Prefill Context Parallism
+        if self.pcp_world_size > 1:


can we please separate all of this into a PCPManager or a utils file to make it more modular and easier to migrate to model runner v2?

OK. We will make similar changes as #28988 .

LucasWilkinson

Thanks for the contribution! A few more comments

LucasWilkinson · 2025-11-25T02:51:34Z

vllm/v1/spec_decode/eagle.py

-            dcp_local_seq_lens=common_attn_metadata.dcp_local_seq_lens,
+            cp_local_seq_lens=common_attn_metadata.cp_local_seq_lens,
+            cp_local_seq_lens_cpu=common_attn_metadata.cp_local_seq_lens_cpu,
+            pcp_metadata=common_attn_metadata.pcp_metadata,


if PCP + spec-decode is not yet supported should we be passing this blindly?

Sure. We have removed these params to avoid misunderstanding.

LucasWilkinson · 2025-11-25T02:59:29Z

vllm/v1/attention/backends/flashinfer.py

+                            ]
+                        )
+                    else:
+                        wrappers_to_check.append((prefill_wrapper._new_tokens, True))


this is kinda messy imo; can you just do something like:

class BatchCPPrefillWrapper: @property def _window_left(self): assert self._context._window_left == self._new_tokens._window_left return self._context._window_left ...

Lests also do:

class FlashInferImpl: def __init__(self, ...): self.is_debugging_mode = envs.VLLM_LOGGING_LEVEL == "DEBUG"

and then only run the asserts if self.is_debugging_mode to avoid excessive CPU overhead on each forward pass

Thanks for your comments. We have changed it.

LucasWilkinson · 2025-11-25T03:00:42Z

vllm/v1/attention/backends/flashinfer.py


        num_actual_tokens = attn_metadata.num_actual_tokens

+        if self.pcp_world_size > 1:


can this be moved into BatchCPPrefillWrapper.run(...)?

No. All gather & restore KV is required in both prefilling and decoding phrase so it cannot be moved into PrefillWrapper. But we have extracted it to vllm/v1/attention/backends/utils.py as a general function, which can be used by other backends.

LucasWilkinson · 2025-11-25T03:02:27Z

vllm/v1/attention/backends/flashinfer.py

+                            out,
+                            lse,
+                            get_dcp_group(),
+                            return_lse=True,


should this be return_lse=self.pcp_world_size > 1

Sure. We have changed it for less unnecessary calculation.

LucasWilkinson · 2025-11-25T03:04:50Z

vllm/v1/attention/backends/utils.py

+    kv_for_head_indices: torch.Tensor | None = None
+    kv_for_tail_indices: torch.Tensor | None = None
+    kv_for_head_indptr: torch.Tensor | None = None
+    kv_for_tail_indptr: torch.Tensor | None = None


this seems FlashInfer specific? we shouldn't have backend specific things (or styles) in CommonAttentionMetadata

Sure. *_indptr params are Flashinfer specific. So we remove *_indptr params from general PrefillContextParallelMetadata and moved the computation of PrefillContextParallelMetadata to the stage of building attention wrapper.
When building attention wrapper, we compute the *_indptr params in specific flashinfer.py and then use general functions such as get_q_indices to get PrefillContextParallelMetadata code. We think this refactoring not only distinguishes the handling of backend-specific parameters but also reduces the dependency on modelrunner.

LucasWilkinson · 2025-11-25T03:06:21Z

vllm/v1/attention/backends/flashinfer.py

+        k_head = torch.index_select(key, 0, kv_for_head_indices)
+        v_head = torch.index_select(value, 0, kv_for_head_indices)
+        k_tail = torch.index_select(key, 0, kv_for_tail_indices)
+        v_tail = torch.index_select(value, 0, kv_for_tail_indices)


so many index_select seems very expensive; have you profiled this?

Yes, we profiled it. Because PCP is only meaningful when the input sequence is long, we conducted performance tests in scenarios where the input sequence length is 32k.
index_select is negligible compared to the benefits brought by PCP.

LucasWilkinson · 2025-11-25T03:07:14Z

vllm/v1/attention/backends/flashinfer.py

+                **common_kwargs,
+            )
+
+    def _attention_with_head_and_tail(


can this be moved into the BatchCPPrefillWrapper?

Sure. We have moved it into BatchCPPrefillWrapper.run().

LucasWilkinson · 2025-11-25T03:11:35Z

vllm/v1/attention/backends/flashinfer.py

+        if return_lse:
+            output_head, lse_head = output_head
+            output_tail, lse_tail = output_tail
+            output = torch.index_select(
+                torch.cat([output_head, output_tail], dim=0),
+                0,
+                q_full_indices,
+            )
+            lse = torch.index_select(
+                torch.cat([lse_head, lse_tail], dim=0),
+                0,
+                q_full_indices,
+            )
+            return output, lse
+        else:
+            output = torch.index_select(
+                torch.cat([output_head, output_tail], dim=0),
+                0,
+                q_full_indices,
+            )
+            return output


Suggested change

if return_lse:

output_head, lse_head = output_head

output_tail, lse_tail = output_tail

output = torch.index_select(

torch.cat([output_head, output_tail], dim=0),

0,

q_full_indices,

)

lse = torch.index_select(

torch.cat([lse_head, lse_tail], dim=0),

0,

q_full_indices,

)

return output, lse

else:

output = torch.index_select(

torch.cat([output_head, output_tail], dim=0),

0,

q_full_indices,

)

return output

if return_lse:

output_head, lse_head = output_head

output_tail, lse_tail = output_tail

lse = torch.index_select(

torch.cat([lse_head, lse_tail], dim=0),

0,

q_full_indices,

)

output = torch.index_select(

torch.cat([output_head, output_tail], dim=0),

0,

q_full_indices,

)

return output if not return_lse else (output, lse)

Thanks. We have changed it.

Livinfly · 2025-11-25T12:10:45Z

vllm/v1/attention/backends/flashinfer.py

-            output_context_tmp, lse_context_tmp, get_dcp_group(), return_lse=True
-        )
-        lse_context = lse_context.transpose(0, 1).contiguous()
+        if self.pcp_world_size > 1:


I wonder whether we forgot to add the calculation between kv_cache_permute and prefill_query here?
Then something like cp_lse_ag_out_rs, cp_lse_ag_out_ar and merge_attn_states correct the output?

Thanks for your comments. Current PCP on flashinfer temporarily does not support chunk prefilling and prefix caching so the kv context can be ignored. Subsequent PRs will add further support.

mergify · 2025-11-25T12:14:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @pisceskkk.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Livinfly · 2025-11-25T12:23:50Z

vllm/v1/attention/backends/flashinfer.py

+            prefill_query_across_dcp = get_dcp_group().all_gather(
+                prefill_query.contiguous(), dim=1
+            )
+            output_context_tmp, lse_context_tmp = self._context.run(


As we store the input keys and values in the cache via reshape_and_cache_flash in FlashInferImpl.forward, this _context calculate between kv_cache_permute and prefill_query_across_dcp will lead to leak / recompute the input keys and values ?

Thanks. Similar reasons as last comment.

Signed-off-by: Jingchun Gao <[email protected]>

FENP · 2025-11-28T03:42:29Z

vllm/v1/attention/backends/flashinfer.py

+                workspace_buffer, get_kv_cache_layout()
+            )
+            pin_memory = is_pin_memory_available()
+            self.pcp_q_indptr_cpu = torch.zeros(


Just wondering — is this wrapper expected to be instantiated only once? I'm concerned about potential repeated pinned memory allocation across instances.

Yes, to my knowledge, these wrappers are initialized only once. Subsequent inference processes will only invoke the plan and run functions.

pisceskkk requested review from ApostaC, LucasWilkinson, ProExpertProg, WoosukKwon, alexm-redhat, heheda12345, hmellor, houseroad, mgoin, njhill, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao, ywang96 and zhuohan123 as code owners November 14, 2025 10:38

mergify bot added nvidia v1 labels Nov 14, 2025

github-project-automation bot added this to NVIDIA Nov 14, 2025

mergify bot added the needs-rebase label Nov 14, 2025

gemini-code-assist bot reviewed Nov 14, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 14, 2025

View reviewed changes

vllm/config/parallel.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

pisceskkk force-pushed the pcp+flashinfer branch 2 times, most recently from c0f45f9 to 489b6c5 Compare November 18, 2025 09:12

mergify bot removed the needs-rebase label Nov 18, 2025

pisceskkk force-pushed the pcp+flashinfer branch 2 times, most recently from 8bc261d to 58cbd8f Compare November 18, 2025 09:54

mergify bot added the needs-rebase label Nov 19, 2025

FENP mentioned this pull request Nov 19, 2025

[Feature][Attention][PCP] Support PCP (Prefill Context Parallel) with MLA #28988

Open

18 tasks

gjc0824 force-pushed the pcp+flashinfer branch 2 times, most recently from 68e3ea6 to ba1b05c Compare November 19, 2025 07:44

mergify bot removed the needs-rebase label Nov 19, 2025

mergify bot added the needs-rebase label Nov 19, 2025

pisceskkk and others added 2 commits November 20, 2025 09:04

pisceskkk force-pushed the pcp+flashinfer branch from 44f658e to 1cac317 Compare November 20, 2025 01:11

pisceskkk requested review from benchislett and luccafong as code owners November 20, 2025 01:11

mergify bot added speculative-decoding and removed needs-rebase labels Nov 20, 2025

Jingchun Gao added 2 commits November 23, 2025 23:04

[Fix] common support for pcp

eb5d34b

Signed-off-by: Jingchun Gao <[email protected]>

[Fix] flashinfer support for pcp

3d65330

Signed-off-by: Jingchun Gao <[email protected]>

gjc0824 force-pushed the pcp+flashinfer branch 2 times, most recently from b9ed205 to d6bbe6d Compare November 24, 2025 01:59

[Lint]

df36e76

Signed-off-by: Jingchun Gao <[email protected]>

gjc0824 force-pushed the pcp+flashinfer branch from d6bbe6d to df36e76 Compare November 24, 2025 02:05

LucasWilkinson reviewed Nov 25, 2025

View reviewed changes

Livinfly reviewed Nov 25, 2025

View reviewed changes

mergify bot added the needs-rebase label Nov 25, 2025

Livinfly reviewed Nov 25, 2025

View reviewed changes

Jingchun Gao added 2 commits November 25, 2025 21:07

Move flashinfer-related params out of utils

3e2f6fd

Signed-off-by: Jingchun Gao <[email protected]>

fix bug&& add doc

07e78b1

Signed-off-by: Jingchun Gao <[email protected]>

gjc0824 force-pushed the pcp+flashinfer branch from 28e2d1a to 07e78b1 Compare November 26, 2025 04:02

FENP reviewed Nov 28, 2025

View reviewed changes


		num_actual_tokens = attn_metadata.num_actual_tokens

		if self.pcp_world_size > 1:

Uh oh!

[Feature] Support Prefill Context Parallel (PCP) for GQA flashinfer #28723

Are you sure you want to change the base?

[Feature] Support Prefill Context Parallel (PCP) for GQA flashinfer #28723

Conversation

pisceskkk commented Nov 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Nov 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Nov 19, 2025

Uh oh!

mergify bot commented Nov 19, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gjc0824 Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Nov 25, 2025

Uh oh!

pisceskkk commented Nov 14, 2025 •

edited by github-actions bot

Loading

gjc0824 Nov 25, 2025 •

edited

Loading

Livinfly Nov 25, 2025 •

edited

Loading