Skip to content

[Feat] support HMA using full attention store(FA) & window attention store(WA)#942

Draft
wuhuxiao wants to merge 1 commit intodevelopfrom
codex/deepseek-v4-ucm-hit-latency-modelengine
Draft

[Feat] support HMA using full attention store(FA) & window attention store(WA)#942
wuhuxiao wants to merge 1 commit intodevelopfrom
codex/deepseek-v4-ucm-hit-latency-modelengine

Conversation

@wuhuxiao
Copy link
Copy Markdown
Contributor

@wuhuxiao wuhuxiao commented Apr 29, 2026

Purpose

Introduce a dedicated HMA/FAWA connector implementation for mixed full-attention and window-attention KV cache groups, and keep ucm_connector.py focused as the vLLM-facing connector entrypoint.

The new connector supports generic KV cache group layouts from vLLM upper-layer kv_cache_config, instead of relying on model-specific group mapping. It stores full-attention groups and window-attention groups separately so prefix-
cache load/save behavior matches their different reuse patterns.

Modifications

  • Added ucm/integration/vllm/hma_connector.py.
    • Moved HMA/FAWA-specific logic out of ucm_connector.py.
    • Added KVCacheGroupLayout for flattened per-group KV tensor pointer layout.
    • Added FAWA request metadata and load/dump task dataclasses.
    • Added UCMFAWAConnector.
  • Updated UCMConnector to support HMA.
    • UCMConnector now inherits SupportsHMA.
    • It accepts and forwards kv_cache_config.
    • It auto-selects UCMFAWAConnector when vLLM KV cache config contains both full-attention and window-attention groups, or when fawa_store is explicitly enabled.
  • Implemented generic FA/WA store design.
    • fa_store stores full-attention KV cache groups.
    • wa_store stores window-attention KV cache groups.
    • FA blocks are loaded for every external prefix hit.
    • WA blocks are saved at each canonical block boundary and only loaded for the final matched prefix boundary.
  • Implemented generic KV cache group handling.
    • KV cache groups are derived directly from vLLM kv_cache_config.kv_cache_groups.
    • Full-attention groups and window-attention groups are partitioned using sliding_window / attention_chunk_size.
    • No DeepSeekV4-specific group naming or hardcoded group mapping remains in the FAWA connector logic.
  • Added compressor-state tail block handling.
    • Tail block count is calculated from window size, storage block size, and compressor ratio.
    • Compressor ratio is read from group spec or model config compress_ratios.
    • Supports 0 tail blocks naturally for cases where window <= ratio.

Test

  • Static checks:
    • python3 -m py_compile ucm/integration/vllm/ucm_connector.py ucm/integration/vllm/hma_connector.py
    • git diff --check -- ucm/integration/vllm/ucm_connector.py ucm/integration/vllm/hma_connector.py
  • End-to-end validation:
    • Ran /vllm-workspace/offline_inference.py with UCM enabled on DeepSeek-V4-Flash using 4 GPUs.
    • Verified FAWA connector was loaded from hma_connector.py.
    • Verified group config:
      • fa_groups=(0,)
      • window_groups=(1, 2, 3, 4)
      • block_sizes=(256, 64, 64, 4, 8)
      • tail_blocks=(None, 2, 2, 1, 0)
    • Verified first request had no external hit.
    • Verified second request hit 2 external prefix blocks.
    • Inference completed and shutdown successfully.
image

@wuhuxiao wuhuxiao force-pushed the codex/deepseek-v4-ucm-hit-latency-modelengine branch 2 times, most recently from 40c7703 to b77d328 Compare May 6, 2026 14:21
@wuhuxiao wuhuxiao changed the title [codex] Optimize DeepSeek V4 UCM hit path [Feat] support HMA using full attention store(FA) & window attention store(WA) May 6, 2026
@wuhuxiao wuhuxiao force-pushed the codex/deepseek-v4-ucm-hit-latency-modelengine branch from b77d328 to 99184e4 Compare May 6, 2026 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant