-
Notifications
You must be signed in to change notification settings - Fork 138
[Bug] Prefill Failure of Qwen3.5 Model Using KV Cache in the mlx‑lm Framework #37
Description
Description
Hello everyone,
I encountered a crash that inevitably occurs during the prefill stage when using the Qwen3.5‑27B model in the mlx‑lm framework. Even the simplest multi‑turn conversation crashes when reusing the KV Cache.
Key error information
When loading the model, mlx‑lm prints the following warning:
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
During the second round of conversation generation, the program crashes and throws the following TypeError:
Traceback (most recent call last):
File "/Users/tom/code/dev/ask2.3/tests/demo/snowball_kv_cache_manager.py", line 415, in <module>
main()
File "/Users/tom/code/dev/ask2.3/tests/demo/snowball_kv_cache_manager.py", line 383, in main
run_conversation_round(
...
File "/Users/tom/venvs/mlx_env_311/lib/python3.11/site-packages/mlx_lm/models/qwen3_5.py", line 251, in __call__
ssm_mask = create_ssm_mask(hidden_states, cache[self.ssm_idx])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/tom/venvs/mlx_env_311/lib/python3.11/site-packages/mlx_lm/models/base.py", line 60, in create_ssm_mask
return cache.make_mask(h.shape[1])
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/tom/code/dev/ask2.3/tests/demo/make_prompt_cache.py", line 402, in make_mask
return create_attention_mask(
^^^^^^^^^^^^^^^^^^^^^^
TypeError: create_attention_mask() missing 2 required positional arguments: 'return_array' and 'window_size'
Minimal reproducible code
Below is a simplified script that reliably reproduces the issue. It simulates a simple two‑turn conversation that crashes when reusing the KV Cache in the second turn.
import mlx.core as mx
from mlx_lm import load, generate
from mlx_lm.models.cache import make_prompt_cache, KVCache, QuantizedKVCache
from mlx_lm.sample_utils import make_sampler
# Model path
model_path = "/Users/tom/models/qwen3.5-27b-7bit"
# Load model
print(f"📦 Loading model from: {model_path}...")
model, tokenizer = load(model_path)
print("✅ Model loaded\n")
# Create KV Cache
# Both standard KVCache and QuantizedKVCache trigger the error
# kv_cache = make_prompt_cache(model)
num_layers = len(model.layers)
kv_cache = [KVCache() for _ in range(num_layers)]
sampler = make_sampler(temp=0.3)
messages = [{"role": "system", "content": "You are a helpful assistant."}]
# --- Round 1 ---
print("=" * 70)
print("🔵 Round 1")
print("=" * 70)
messages.append({"role": "user", "content": "Hello"})
prompt_r1 = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response_r1 = generate(model, tokenizer, prompt=prompt_r1, prompt_cache=kv_cache, verbose=True)
messages.append({"role": "assistant", "content": response_r1})
# --- Round 2 (reuse kv_cache) ---
print("\n" + "=" * 70)
print("🟢 Round 2 (This will crash)")
print("=" * 70)
messages.append({"role": "user", "content": "Who are you?"})
prompt_r2 = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Error occurs here
response_r2 = generate(model, tokenizer, prompt=prompt_r2, prompt_cache=kv_cache, verbose=True)
print("This line will not be reached.")Problem analysis
From the stack trace, the root cause appears to be that the model implementation in qwen3_5.py creates its unique SSM (State‑Space Model) mask by calling cache.make_mask() with only one argument:
# mlx_lm/models/qwen3_5.py
ssm_mask = create_ssm_mask(hidden_states, cache[self.ssm_idx])
# mlx_lm/models/base.py
def create_ssm_mask(h, cache):
return cache.make_mask(h.shape[1]) # <--- only one argument passedHowever, the standard KVCache or QuantizedKVCache's make_mask method in mlx‑lm requires two additional arguments, window_size and return_array, and they have no default values. This mismatch leads to the TypeError.
Although I attempted to modify make_prompt_cache.py to provide default values for make_mask, this seems to be a deeper compatibility issue that should be resolved at the model or framework level.
I hope this information helps you quickly locate the problem. Qwen3.5 is a very impressive model, and I look forward to using it smoothly on mlx‑lm!
Thank you!
Reproduction
emptyLogs
Environment Information
Environment
- Operating System: macOS (Apple Silicon)
- Python version: 3.11
- mlx‑lm version: latest (installed from source via
pip install .) - Model path:
/Users/tom/models/qwen3.5-27b-7bit
Known Issue
- The issue hasn't been already addressed in Documentation, Issues, and Discussions.