feat: add KV cache quantization args to server#1073
Open
deceptech-packet-ninja wants to merge 2 commits intoml-explore:mainfrom
Open
feat: add KV cache quantization args to server#1073deceptech-packet-ninja wants to merge 2 commits intoml-explore:mainfrom
deceptech-packet-ninja wants to merge 2 commits intoml-explore:mainfrom
Conversation
Enables KV cache quantization in mlx_lm.server, closing ml-explore#1043. Batching disabled when kv_bits is set (BatchQuantizedKVCache NYI). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Clean server plumbing for #1074. The --kv-bits flag is needed for anyone running long-context inference through the API. I would use this immediately for my 122B production server. One minor note: disabling batching when kv_bits is set makes sense for now, but BatchedEngine + quantized KV cache is a natural follow-up since continuous batching is where long-context memory pressure hits hardest. |
6 tasks
Adopt upstream's simplified _is_batchable one-liner while keeping the kv_bits guard that disables batching when KV cache quantization is active. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--kv-bits,--kv-group-size, and--quantized-kv-startCLI arguments tomlx_lm.serverstream_generate→generate_step, which already supports KV cache quantizationkv_bitsis set sinceBatchQuantizedKVCachedoes not exist yetCloses #1043.
Motivation
KV cache quantization has been available via
mlx_lm.generateCLI (--kv-bits) since v0.22, but the server has no way to enable it. Users running long-context inference through the OpenAI-compatible API cannot reduce KV cache memory usage.Changes
mlx_lm/server.py(26 lines added):main():--kv-bits(int, default None),--kv-group-size(int, default 64),--quantized-kv-start(int, default 0)_serve_single: constructskv_kwargsdict and passes tostream_generatevia**kv_kwargs_is_batchable: returnsFalsewhenkv_bitsis set (batched quantized cache NYI)Test plan
python -m mlx_lm.server --model mlx-community/Qwen2.5-0.5B-Instruct-4bit --kv-bits 4starts without error--kv-bits, behavior is unchanged--kv-bits, batching is correctly disabled (sequential path used)🤖 Generated with Claude Code