[Feat] support HMA using full attention store(FA) & window attention store(WA) by wuhuxiao · Pull Request #942 · ModelEngine-Group/unified-cache-management

wuhuxiao · 2026-04-29T09:21:06Z

Purpose

Introduce a dedicated HMA/FAWA connector implementation for mixed full-attention and window-attention KV cache groups, and keep ucm_connector.py focused as the vLLM-facing connector entrypoint.

The new connector supports generic KV cache group layouts from vLLM upper-layer kv_cache_config, instead of relying on model-specific group mapping. It stores full-attention groups and window-attention groups separately so prefix-
cache load/save behavior matches their different reuse patterns.

Modifications

Added ucm/integration/vllm/hma_connector.py.
- Moved HMA/FAWA-specific logic out of ucm_connector.py.
- Added KVCacheGroupLayout for flattened per-group KV tensor pointer layout.
- Added FAWA request metadata and load/dump task dataclasses.
- Added UCMFAWAConnector.
Updated UCMConnector to support HMA.
- UCMConnector now inherits SupportsHMA.
- It accepts and forwards kv_cache_config.
- It auto-selects UCMFAWAConnector when vLLM KV cache config contains both full-attention and window-attention groups, or when fawa_store is explicitly enabled.
Implemented generic FA/WA store design.
- fa_store stores full-attention KV cache groups.
- wa_store stores window-attention KV cache groups.
- FA blocks are loaded for every external prefix hit.
- WA blocks are saved at each canonical block boundary and only loaded for the final matched prefix boundary.
Implemented generic KV cache group handling.
- KV cache groups are derived directly from vLLM kv_cache_config.kv_cache_groups.
- Full-attention groups and window-attention groups are partitioned using sliding_window / attention_chunk_size.
- No DeepSeekV4-specific group naming or hardcoded group mapping remains in the FAWA connector logic.
Added compressor-state tail block handling.
- Tail block count is calculated from window size, storage block size, and compressor ratio.
- Compressor ratio is read from group spec or model config compress_ratios.
- Supports 0 tail blocks naturally for cases where window <= ratio.

Test

Static checks:
- python3 -m py_compile ucm/integration/vllm/ucm_connector.py ucm/integration/vllm/hma_connector.py
- git diff --check -- ucm/integration/vllm/ucm_connector.py ucm/integration/vllm/hma_connector.py
End-to-end validation:
- Ran /vllm-workspace/offline_inference.py with UCM enabled on DeepSeek-V4-Flash using 4 GPUs.
- Verified FAWA connector was loaded from hma_connector.py.
- Verified group config:
  - fa_groups=(0,)
  - window_groups=(1, 2, 3, 4)
  - block_sizes=(256, 64, 64, 4, 8)
  - tail_blocks=(None, 2, 2, 1, 0)
- Verified first request had no external hit.
- Verified second request hit 2 external prefix blocks.
- Inference completed and shutdown successfully.

wuhuxiao force-pushed the codex/deepseek-v4-ucm-hit-latency-modelengine branch 2 times, most recently from 40c7703 to b77d328 Compare May 6, 2026 14:21

wuhuxiao changed the title ~~[codex] Optimize DeepSeek V4 UCM hit path~~ [Feat] support HMA using full attention store(FA) & window attention store(WA) May 6, 2026

support HMA using full attention store(FA) & window attention store(WA)

99184e4

wuhuxiao force-pushed the codex/deepseek-v4-ucm-hit-latency-modelengine branch from b77d328 to 99184e4 Compare May 6, 2026 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] support HMA using full attention store(FA) & window attention store(WA)#942

[Feat] support HMA using full attention store(FA) & window attention store(WA)#942
wuhuxiao wants to merge 1 commit intodevelopfrom
codex/deepseek-v4-ucm-hit-latency-modelengine

wuhuxiao commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wuhuxiao commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Modifications

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wuhuxiao commented Apr 29, 2026 •

edited

Loading