[Feat] support HMA using full attention store(FA) & window attention store(WA)#942
Draft
[Feat] support HMA using full attention store(FA) & window attention store(WA)#942
Conversation
40c7703 to
b77d328
Compare
b77d328 to
99184e4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Introduce a dedicated HMA/FAWA connector implementation for mixed full-attention and window-attention KV cache groups, and keep ucm_connector.py focused as the vLLM-facing connector entrypoint.
The new connector supports generic KV cache group layouts from vLLM upper-layer kv_cache_config, instead of relying on model-specific group mapping. It stores full-attention groups and window-attention groups separately so prefix-
cache load/save behavior matches their different reuse patterns.
Modifications
Test