Fix paged-attention KV cache dtype + size accounting (issue #119) by LxYuan0420 · Pull Request #125 · vllm-project/vllm-metal

LxYuan0420 · 2026-03-01T08:00:32Z

This PR is:

To align the Metal paged-attention KV cache dtype with the model's dtype (fixes batched decode parity for Metal paged-attention parity mismatch vs standard path #119).
To compute KV cache byte sizes via torch.dtype.itemsize instead of allocating temporary tensors.

Notes:

tests/test_metal_kernel_paged.py::test_batched_decode_matches now passes.
tests/test_metal_kernel_paged.py::test_greedy_output_matches remains xfailed (tracked in Metal paged-attention parity mismatch vs standard path #119). This is a remaining single-request greedy parity mismatch between the paged-kernel path and the standard path; fixing it likely requires deeper kernel/offset semantics work, so I'm keeping it out of this PR to keep scope tight.

Quick manual smoke test:

Terminal 1:

vllm serve Qwen/Qwen3-0.6B --host 127.0.0.1 --port 8000 --max-model-len 2048

Terminal 2 (single request):

curl -fsS http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"Write a 2-sentence apple story."}],"max_tokens":512,"temperature":0.8}' \
| jq -r '.choices[0].message.content'

Terminal 2 (concurrent 4 requests):

for i in 1 2 3 4; do
  (
    echo "===== req $i ====="
    curl -fsS http://127.0.0.1:8000/v1/chat/completions \
      -H 'Content-Type: application/json' \
      -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"Write a 2-sentence apple story (${i}).\"}],\"max_tokens\":256,\"temperature\":0.8}" \
    | jq -r '.choices[0].message.content'
    echo
  ) &
done
wait

Related: #119

WindChimeRan · 2026-03-01T16:58:04Z

@LxYuan0420
Thanks for tracking this down and fixing it!

The dtype fix looks good to me. The root cause is clear (hardcoded float16 when the model's actual dtype is bfloat16), The kernel already supports all three float types natively, so this was purely a Python-side plumbing issue.

One suggestion: we now have a diagnostic tool in #127 (tools/avg_gen_length.py) that runs offline inference on ShareGPT prompts and reports response length statistics (mean/std). It would be great to run it before and after this fix to quantify the improvement, specifically comparing the paged path (VLLM_METAL_USE_PAGED_ATTENTION=1) against the non-paged baseline. If the distributions align more closely after the fix, that's a strong quantitative signal beyond the existing test assertions.

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

keeps the fallback path backward-compatible with the historical paged-attention default and avoids silently changing behavior when dtype inference fails (e.g., unexpected model structure or quantized weights). Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

Why: avoids allocating temporary tensors just to compute element size; clearer and cheaper. Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

LxYuan0420 · 2026-03-02T16:43:26Z

Good point. I ran a small end-to-end sanity check with tools/avg_gen_length.py (ShareGPT, fixed seed) and it completes without crashing on my side. For meaningful numbers, could you paste the exact command + summary table you’re using so we can standardize on that as the reference?
@WindChimeRan

WindChimeRan · 2026-03-03T04:00:23Z

Findings

--max-num-seqs appears to not take effect on the mlx_lm path. In my runs, the script seems to ignore this flag and proceeds with its own batching behavior.
The response-length distributions for main-branch mlx_lm and paged KV are currently very similar under this setup.
With the current experimental setting, the results are inconclusive as a signal for quality/regression detection.

This may indicate we need a more discriminative setup (e.g., a different dataset, longer decoding lengths, or settings that amplify divergence if it exists).

As a simple short-term check, we could also manually inspect a small sample of responses side by side. If the response quality improves after the fix, or if the paged and non-paged paths look qualitatively similar, that would still be useful supporting evidence.

e.g.,

what's the capital of France?
write a quicksort in python
...

Experiment

Without this patch, on the main branch:

# mlx_lm path
python tools/avg_gen_length.py

============================================================
  max_num_seqs      N   Mean tokens        Std
------------------------------------------------------------
             1    100         244.8       33.4
             8    100         243.6       36.5
============================================================

# paged kv  path
# no batch size 1, taking too long
VLLM_METAL_USE_PAGED_ATTENTION=1 VLLM_METAL_MEMORY_FRACTION=0.4 python tools/avg_gen_length.py --max-num-seqs 8

============================================================
  max_num_seqs      N   Mean tokens        Std
------------------------------------------------------------
             8    100         243.5       36.6
============================================================

LxYuan0420 · 2026-03-04T09:52:08Z

@WindChimeRan Was the "before" run using the same settings like VLLM_METAL_MEMORY_FRACTION=0.4 ?

LxYuan0420 · 2026-03-04T09:52:46Z

I think the current result is good enough to show stability under pressure. This PR is a bug fix, not a performance improvement: it fixes the block-exhaustion failure (RuntimeError: Not enough free blocks) by aligning allocation/preemption behavior with scheduler-driven recompute

@ericcurtin PTAL

ericcurtin · 2026-03-04T14:30:38Z

LGTM, the dtype fix is clean and the root cause (hardcoded float16 vs actual model dtype) is clear. The KvCacheDtypeInference abstraction is well-scoped and the fallback behavior is sensible. Merging.

This PR is: - To remove a stale `xfail` on `test_greedy_output_matches` that was originally added for issue #119. - To align test expectation with current `main` behavior after paged-path fixes already merged. - To keep parity tracking accurate while leaving batched behavior to its own tracking path. ## Context Issue #119 reported token mismatch parity failures between: - standard MLX KV cache path, and - Metal paged-attention path. Since then, two key fixes landed: - #125 corrected paged KV cache dtype inference/fallback behavior and KV cache size accounting used by paged memory/block calculations. - #136 replaced the HF/PyTorch kernel-bridge path with native MLX + inline Metal JIT dispatch (`get_ops`/nanobind), removing cross-framework bridge behavior from paged execution. With those changes, the old greedy mismatch from #119 no longer reproduces on `main`, so the greedy `xfail` is stale. ## Verification ```bash pytest -q tests/test_metal_kernel_paged.py::TestMetalKernelPagedVsStandard::test_greedy_output_matches -s pytest -m slow -q tests/test_metal_kernel_paged.py ``` Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

LxYuan0420 requested a review from ericcurtin March 1, 2026 08:06

LxYuan0420 self-assigned this Mar 1, 2026

WindChimeRan mentioned this pull request Mar 2, 2026

Metal paged-attention parity mismatch vs standard path #119

Closed

LxYuan0420 added 7 commits March 3, 2026 00:38

Align paged KV cache dtype with model

2507e95

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

Infer paged cache dtype in kernel parity tests

ad3071d

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

Centralize KV cache dtype policy

4456f96

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

Fail fast on KV dtype fallback in kernel tests

f851a56

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

Use dtype.itemsize for KV cache sizing

895e73b

Why: avoids allocating temporary tensors just to compute element size; clearer and cheaper. Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

Fix mypy narrowing in prefix-cache RotatingKVCache test

8b578cb

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

LxYuan0420 force-pushed the fix/issue-119-paged-parity branch from a75ea44 to 8b578cb Compare March 2, 2026 16:40

ericcurtin approved these changes Mar 4, 2026

View reviewed changes

ericcurtin merged commit 59b9be4 into vllm-project:main Mar 4, 2026
5 checks passed

WindChimeRan mentioned this pull request Mar 6, 2026

[Paged KV] Replace heuristic KV cache dtype inference with direct approach #142

Closed

LxYuan0420 mentioned this pull request Mar 9, 2026

remove stale issue #119 xfail for greedy paged-parity test #149

Merged

This was referenced Mar 24, 2026

Add OpenAI Responses API core computor-org/vllm-mlx#1

Merged

server: add OpenAI-compatible /v1/responses endpoint waybarrios/vllm-mlx#214

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix paged-attention KV cache dtype + size accounting (issue #119)#125

Fix paged-attention KV cache dtype + size accounting (issue #119)#125
ericcurtin merged 7 commits intovllm-project:mainfrom
LxYuan0420:fix/issue-119-paged-parity

LxYuan0420 commented Mar 1, 2026

Uh oh!

WindChimeRan commented Mar 1, 2026

Uh oh!

LxYuan0420 commented Mar 2, 2026

Uh oh!

WindChimeRan commented Mar 3, 2026 •

edited

Loading

Uh oh!

LxYuan0420 commented Mar 4, 2026

Uh oh!

LxYuan0420 commented Mar 4, 2026

Uh oh!

ericcurtin commented Mar 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

LxYuan0420 commented Mar 1, 2026

Uh oh!

WindChimeRan commented Mar 1, 2026

Uh oh!

LxYuan0420 commented Mar 2, 2026

Uh oh!

WindChimeRan commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Findings

Experiment

Uh oh!

LxYuan0420 commented Mar 4, 2026

Uh oh!

LxYuan0420 commented Mar 4, 2026

Uh oh!

ericcurtin commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WindChimeRan commented Mar 3, 2026 •

edited

Loading

ericcurtin commented Mar 4, 2026 •

edited

Loading