[GPUModelRunner] Refactor initialize_kv_cache #27935

heheda12345 · 2025-11-02T07:13:07Z

Purpose

Clean up initialize_kv_cache after the fast development of various attention-related features in the previous months to simplify the code and dependency between steps. It mainly touches the logic of

kv sharing
encoder only
hybrid mamba

Key modifications:

Some steps can be done after model loading. Move them to the end of load_models. [may needs some discussion]
Many operation (e.g., compute kernel_block_size) only needs the AttentionBackend type of each layer. We don't have to do them after builder is initialized.
kernel_block_size now includes EncoderOnlyAttention
simplify attn_group split logic
start to move some logic into utils so that it can be used in ModelRunnerV2 in the future.

I can split the above items into separate PRs if needed

depends on
#27929
#27753

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-11-06T15:33:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Chen Zhang <[email protected]>

…n_init Signed-off-by: Chen Zhang <[email protected]>

Signed-off-by: Chen Zhang <[email protected]>

mergify · 2025-11-11T12:53:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the v1 label Nov 2, 2025

mergify bot added the needs-rebase label Nov 6, 2025

init

bd85259

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 force-pushed the refactor_attn_init branch from 9689fba to bd85259 Compare November 7, 2025 00:03

Merge branch 'main' of github.com:vllm-project/vllm into refactor_att…

c505265

…n_init Signed-off-by: Chen Zhang <[email protected]>

mergify bot removed the needs-rebase label Nov 7, 2025

clean up

dc15b35

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 mentioned this pull request Nov 7, 2025

[GPUModelRunner] initialize_kv_cache cleanup (1/N): move initialization that doesn't depend on kv cache config to load_model #28258

Open

5 tasks

mergify bot added the needs-rebase label Nov 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[GPUModelRunner] Refactor initialize_kv_cache #27935

[GPUModelRunner] Refactor initialize_kv_cache #27935

Uh oh!

heheda12345 commented Nov 2, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Nov 6, 2025

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[GPUModelRunner] Refactor initialize_kv_cache #27935

Are you sure you want to change the base?

[GPUModelRunner] Refactor initialize_kv_cache #27935

Uh oh!

Conversation

heheda12345 commented Nov 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Nov 6, 2025

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

heheda12345 commented Nov 2, 2025 •

edited by github-actions bot

Loading