[GPUModelRunner] initialize_kv_cache cleanup (1/N): move initialization that doesn't depend on kv cache config to load_model #28258

heheda12345 · 2025-11-07T01:22:36Z

Purpose

Some steps in model initialization only depends on layer attributes and doesn't depend on kv cache config. Move them to load_model so that initialize_kv_cache can focus on kv_cache + attention backend initialization

This PR also starts to change some functions in gpu model runner into pure functions so that they can be reused by model runner v2 in the future.

Note that test_kv_sharing_fast_prefill fails on main and is marked as optional. This PR raises the same error on this test.

Split from #27935

Test Plan

kv sharing:
Run basic.py with:

llm = LLM(
        model="google/gemma-3n-E2B-it",
        enforce_eager=True,
        kv_sharing_fast_prefill=True,
    )

encoder only:
pytest -vs tests/entrypoints/pooling/llm/test_embedding.py::test_pooling_params

Test Result

Generated Outputs:
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    " Alex, and I'm a freelance graphic designer. I'm passionate about"
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' currently taking a trip to Europe. He is visiting several countries, including France,'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' Paris.\n\nThis is a true statement.\n'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    " a complex and fascinating topic. Here's a breakdown of key trends, potential"
------------------------------------------------------------

pytest -vs tests/entrypoints/pooling/llm/test_embedding.py::test_pooling_params

test passed

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chen Zhang <[email protected]>

gemini-code-assist

Code Review

This pull request is a good step towards cleaning up the GPUModelRunner by moving initialization logic that doesn't depend on KVCacheConfig from initialize_kv_cache to load_model. The introduction of pure functions in utils.py is also a positive change for future code reuse. The refactoring appears to be correct and well-executed. I have one suggestion regarding code duplication to further improve the cleanup.

vllm/v1/worker/gpu_model_runner.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Chen Zhang <[email protected]>

LucasWilkinson

It looks like get_attn_backend_cls is only used by _check_and_update_cudagraph_mode

        attention_backends = set(
            get_attn_backend_cls(
                self.vllm_config, self.kv_sharing_fast_prefill_eligible_layers
            ).values()
        )

Maybe we should use it in initialize_attn_backend too; i.e.

    def initialize_attn_backend(self, kv_cache_config: KVCacheConfig) -> None:
        """
        Initialize the attention backends and attention metadata builders.
        """
        assert len(self.attn_groups) == 0, "Attention backends are already initialized"

        class AttentionGroupKey(NamedTuple):
            attn_backend: type[AttentionBackend]
            kv_cache_spec: KVCacheSpec

        attn_backends_dict = get_attn_backend_cls(
                self.vllm_config, self.kv_sharing_fast_prefill_eligible_layers
        )

        def get_attn_backends_for_group(
            kv_cache_group_spec: KVCacheGroupSpec,
        ) -> tuple[dict[AttentionGroupKey, list[str]], set[type[AttentionBackend]]]:
            ...
            for layer_name in kv_cache_group_spec.layer_names:
                attn_backend = attn_backends_dict[layer_name]

then we can get rid of the duplicate

if layer_name in self.kv_sharing_fast_prefill_eligible_layers:
    attn_backend = create_fast_prefill_custom_backend(
        "FastPrefill",
        attn_backend,
    )

(or we should make it just return set if its only get used by _check_and_update_cudagraph_mode)

LucasWilkinson · 2025-11-07T03:36:50Z

vllm/v1/worker/utils.py

+    )
+
+
+def get_attn_backend_cls(


nit: can we name this get_attn_backend_clss or get_attn_backend_cls_dict currently the name kinda implies it only returns one backend class

good suggestion. Updated.

heheda12345 · 2025-11-07T05:28:05Z

@LucasWilkinson you can check #27935 for the final version of my plan. I'll use get_attn_backend_cls in more places (including initialize_attn_backend). It will be done in a future PR.

Signed-off-by: Chen Zhang <[email protected]>

markmc

Moving these steps into load_model() lgtm, but I definitely can't guarantee I haven't missed something subtle!

markmc · 2025-11-07T13:13:24Z

vllm/v1/worker/gpu_model_runner.py

+        (
+            self.shared_kv_cache_layers,
+            self.kv_sharing_fast_prefill_eligible_layers,
+            self.kv_sharing_fast_prefill_logits_indices,


This is already initialized in the constructor, you forgot to remove it from there?

But leaving it in the constructor seems fine? That would also mean the device arg can be removed from utils.kv_sharing() and make it only return layers which is a nice simplification

markmc · 2025-11-07T13:21:03Z

vllm/v1/worker/utils.py

+    for layer_name, attn_module in attn_layers.items():
+        attn_backend = attn_module.get_attn_backend()
+        if layer_name in kv_sharing_fast_prefill_eligible_layers:
+            attn_backend = create_fast_prefill_custom_backend(


Hmm, this gets created again later in initialize_attn_backend() ?

Can this be avoided? e.g. can it be created elsewhere earlier so get_layers_from_vllm_config() returns it?

Sorry, this is probably a duplicate of @LucasWilkinson comment

markmc · 2025-11-07T13:24:38Z

vllm/v1/worker/gpu_model_runner.py

+        )
+        self.runner_only_attn_layers.update(
+            get_runner_only_attn_layers(self.vllm_config)
+        )


Can we get rid of the initialization in the constructor and the assertion here, and just do

self.runner_only_attn_layers = get_runner_only_attn_layers(self.vllm_config)

markmc · 2025-11-07T13:26:21Z

vllm/v1/worker/gpu_model_runner.py

    check_ubatch_thresholds,
 )
-from vllm.v1.worker.utils import is_residual_scattered_for_sp
+from vllm.v1.worker.utils import (


Just a thought ... what belongs in utils, and what belongs in the model runner? Are you putting these in utils so they can be re-used by other model runners? Is that a good refactoring goal in general - move as much code as possible into utils?

Oh, I see

This PR also starts to change some functions in gpu model runner into pure functions so that they can be reused by model runner v2 in the future.

I guess my preference would be to keep the purpose of the PR clean - move some steps into load_model() in this PR and do some more complete "prepare for model runner v2" refactoring in a separate PR. It's hard to judge whether these functions are a positive refactoring move in the context of this PR

Not a strong objection though 🤷

mergify · 2025-11-13T07:05:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

clean up

d6d230f

Signed-off-by: Chen Zhang <[email protected]>

mergify bot added the v1 label Nov 7, 2025

heheda12345 requested a review from LucasWilkinson November 7, 2025 01:23

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 7, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

heheda12345 added 2 commits November 6, 2025 17:28

Merge branch 'main' of github.com:vllm-project/vllm into move_layer_init

cd404c3

clean up

857510d

Signed-off-by: Chen Zhang <[email protected]>

LucasWilkinson reviewed Nov 7, 2025

View reviewed changes

heheda12345 added 2 commits November 6, 2025 21:31

get_attn_backend_classes

28df894

Signed-off-by: Chen Zhang <[email protected]>

fix bug

c8a128d

Signed-off-by: Chen Zhang <[email protected]>

markmc approved these changes Nov 7, 2025

View reviewed changes

mergify bot added the needs-rebase label Nov 13, 2025

Uh oh!

[GPUModelRunner] initialize_kv_cache cleanup (1/N): move initialization that doesn't depend on kv cache config to load_model #28258

Are you sure you want to change the base?

[GPUModelRunner] initialize_kv_cache cleanup (1/N): move initialization that doesn't depend on kv cache config to load_model #28258

Conversation

heheda12345 commented Nov 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

heheda12345 commented Nov 7, 2025 •

edited by github-actions bot

Loading

heheda12345 commented Nov 7, 2025 •

edited

Loading