Skip to content

feat: add KV cache quantization args to server#1073

Open
deceptech-packet-ninja wants to merge 2 commits intoml-explore:mainfrom
deceptech-packet-ninja:feat/server-kv-bits
Open

feat: add KV cache quantization args to server#1073
deceptech-packet-ninja wants to merge 2 commits intoml-explore:mainfrom
deceptech-packet-ninja:feat/server-kv-bits

Conversation

@deceptech-packet-ninja
Copy link
Copy Markdown

Summary

  • Adds --kv-bits, --kv-group-size, and --quantized-kv-start CLI arguments to mlx_lm.server
  • Wires these through to stream_generategenerate_step, which already supports KV cache quantization
  • Disables batching when kv_bits is set since BatchQuantizedKVCache does not exist yet

Closes #1043.

Motivation

KV cache quantization has been available via mlx_lm.generate CLI (--kv-bits) since v0.22, but the server has no way to enable it. Users running long-context inference through the OpenAI-compatible API cannot reduce KV cache memory usage.

Changes

mlx_lm/server.py (26 lines added):

  • Three new CLI arguments in main(): --kv-bits (int, default None), --kv-group-size (int, default 64), --quantized-kv-start (int, default 0)
  • In _serve_single: constructs kv_kwargs dict and passes to stream_generate via **kv_kwargs
  • In _is_batchable: returns False when kv_bits is set (batched quantized cache NYI)

Test plan

  • python -m mlx_lm.server --model mlx-community/Qwen2.5-0.5B-Instruct-4bit --kv-bits 4 starts without error
  • Sending a request generates correct output
  • Without --kv-bits, behavior is unchanged
  • With --kv-bits, batching is correctly disabled (sequential path used)

🤖 Generated with Claude Code

Enables KV cache quantization in mlx_lm.server, closing ml-explore#1043.
Batching disabled when kv_bits is set (BatchQuantizedKVCache NYI).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Thump604
Copy link
Copy Markdown

Clean server plumbing for #1074. The --kv-bits flag is needed for anyone running long-context inference through the API. I would use this immediately for my 122B production server.

One minor note: disabling batching when kv_bits is set makes sense for now, but BatchedEngine + quantized KV cache is a natural follow-up since continuous batching is where long-context memory pressure hits hardest.

Adopt upstream's simplified _is_batchable one-liner while keeping
the kv_bits guard that disables batching when KV cache quantization
is active.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add KV cache quantization support to server

2 participants