feat: add KV cache quantization args to server by deceptech-packet-ninja · Pull Request #1073 · ml-explore/mlx-lm

deceptech-packet-ninja · 2026-03-30T19:01:16Z

Summary

Adds --kv-bits, --kv-group-size, and --quantized-kv-start CLI arguments to mlx_lm.server
Wires these through to stream_generate → generate_step, which already supports KV cache quantization
Disables batching when kv_bits is set since BatchQuantizedKVCache does not exist yet

Closes #1043.

Motivation

KV cache quantization has been available via mlx_lm.generate CLI (--kv-bits) since v0.22, but the server has no way to enable it. Users running long-context inference through the OpenAI-compatible API cannot reduce KV cache memory usage.

Changes

mlx_lm/server.py (26 lines added):

Three new CLI arguments in main(): --kv-bits (int, default None), --kv-group-size (int, default 64), --quantized-kv-start (int, default 0)
In _serve_single: constructs kv_kwargs dict and passes to stream_generate via **kv_kwargs
In _is_batchable: returns False when kv_bits is set (batched quantized cache NYI)

Test plan

python -m mlx_lm.server --model mlx-community/Qwen2.5-0.5B-Instruct-4bit --kv-bits 4 starts without error
Sending a request generates correct output
Without --kv-bits, behavior is unchanged
With --kv-bits, batching is correctly disabled (sequential path used)

🤖 Generated with Claude Code

Enables KV cache quantization in mlx_lm.server, closing ml-explore#1043. Batching disabled when kv_bits is set (BatchQuantizedKVCache NYI). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Thump604 · 2026-03-31T04:00:07Z

Clean server plumbing for #1074. The --kv-bits flag is needed for anyone running long-context inference through the API. I would use this immediately for my 122B production server.

One minor note: disabling batching when kv_bits is set makes sense for now, but BatchedEngine + quantized KV cache is a natural follow-up since continuous batching is where long-context memory pressure hits hardest.

Adopt upstream's simplified _is_batchable one-liner while keeping the kv_bits guard that disables batching when KV cache quantization is active. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add --kv-bits, --kv-group-size, --quantized-kv-start to server

5778fdb

Enables KV cache quantization in mlx_lm.server, closing ml-explore#1043. Batching disabled when kv_bits is set (BatchQuantizedKVCache NYI). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

deceptech-packet-ninja mentioned this pull request Apr 2, 2026

Add TurboQuant KV cache compression (3-bit, 4.6x) #1067

Open

6 tasks

merge: resolve conflict with upstream batch generation refactor

5a67289

Adopt upstream's simplified _is_batchable one-liner while keeping the kv_bits guard that disables batching when KV cache quantization is active. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add KV cache quantization args to server#1073

feat: add KV cache quantization args to server#1073
deceptech-packet-ninja wants to merge 2 commits intoml-explore:mainfrom
deceptech-packet-ninja:feat/server-kv-bits

deceptech-packet-ninja commented Mar 30, 2026

Uh oh!

Thump604 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

deceptech-packet-ninja commented Mar 30, 2026

Summary

Motivation

Changes

Test plan

Uh oh!

Thump604 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants