[Bug] Prefill Failure of Qwen3.5 Model Using KV Cache in the mlx‑lm Framework

### Description

Hello everyone,

I encountered a crash that inevitably occurs during the prefill stage when using the Qwen3.5‑27B model in the mlx‑lm framework. Even the simplest multi‑turn conversation crashes when reusing the KV Cache.

## Key error information

When loading the model, mlx‑lm prints the following warning:

```
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
```

During the second round of conversation generation, the program crashes and throws the following `TypeError`:

```plaintext
Traceback (most recent call last):
  File "/Users/tom/code/dev/ask2.3/tests/demo/snowball_kv_cache_manager.py", line 415, in <module>
    main()
  File "/Users/tom/code/dev/ask2.3/tests/demo/snowball_kv_cache_manager.py", line 383, in main
    run_conversation_round(
  ...
  File "/Users/tom/venvs/mlx_env_311/lib/python3.11/site-packages/mlx_lm/models/qwen3_5.py", line 251, in __call__
    ssm_mask = create_ssm_mask(hidden_states, cache[self.ssm_idx])
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tom/venvs/mlx_env_311/lib/python3.11/site-packages/mlx_lm/models/base.py", line 60, in create_ssm_mask
    return cache.make_mask(h.shape[1])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tom/code/dev/ask2.3/tests/demo/make_prompt_cache.py", line 402, in make_mask
    return create_attention_mask(
           ^^^^^^^^^^^^^^^^^^^^^^
TypeError: create_attention_mask() missing 2 required positional arguments: 'return_array' and 'window_size'
```
## Minimal reproducible code

Below is a simplified script that reliably reproduces the issue. It simulates a simple two‑turn conversation that crashes when reusing the KV Cache in the second turn.

```python
import mlx.core as mx
from mlx_lm import load, generate
from mlx_lm.models.cache import make_prompt_cache, KVCache, QuantizedKVCache
from mlx_lm.sample_utils import make_sampler

# Model path
model_path = "/Users/tom/models/qwen3.5-27b-7bit"

# Load model
print(f"📦 Loading model from: {model_path}...")
model, tokenizer = load(model_path)
print("✅ Model loaded\n")

# Create KV Cache
# Both standard KVCache and QuantizedKVCache trigger the error
# kv_cache = make_prompt_cache(model)
num_layers = len(model.layers)
kv_cache = [KVCache() for _ in range(num_layers)]

sampler = make_sampler(temp=0.3)
messages = [{"role": "system", "content": "You are a helpful assistant."}]

# --- Round 1 ---
print("=" * 70)
print("🔵 Round 1")
print("=" * 70)
messages.append({"role": "user", "content": "Hello"})
prompt_r1 = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response_r1 = generate(model, tokenizer, prompt=prompt_r1, prompt_cache=kv_cache, verbose=True)
messages.append({"role": "assistant", "content": response_r1})

# --- Round 2 (reuse kv_cache) ---
print("\n" + "=" * 70)
print("🟢 Round 2 (This will crash)")
print("=" * 70)
messages.append({"role": "user", "content": "Who are you?"})
prompt_r2 = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Error occurs here
response_r2 = generate(model, tokenizer, prompt=prompt_r2, prompt_cache=kv_cache, verbose=True)

print("This line will not be reached.")
```

## Problem analysis

From the stack trace, the root cause appears to be that the model implementation in `qwen3_5.py` creates its unique SSM (State‑Space Model) mask by calling `cache.make_mask()` with only one argument:

```python
# mlx_lm/models/qwen3_5.py
ssm_mask = create_ssm_mask(hidden_states, cache[self.ssm_idx])

# mlx_lm/models/base.py
def create_ssm_mask(h, cache):
    return cache.make_mask(h.shape[1])  # &lt;--- only one argument passed
```

However, the standard `KVCache` or `QuantizedKVCache`'s `make_mask` method in mlx‑lm requires two additional arguments, `window_size` and `return_array`, and they have no default values. This mismatch leads to the `TypeError`.

Although I attempted to modify `make_prompt_cache.py` to provide default values for `make_mask`, this seems to be a deeper compatibility issue that should be resolved at the model or framework level.

I hope this information helps you quickly locate the problem. Qwen3.5 is a very impressive model, and I look forward to using it smoothly on mlx‑lm!

Thank you!

### Reproduction

```python
empty
```

### Logs

```shell

```

### Environment Information

## Environment

- **Operating System**: macOS (Apple Silicon)  
- **Python version**: 3.11  
- **mlx‑lm version**: latest (installed from source via `pip install .`)  
- **Model path**: `/Users/tom/models/qwen3.5-27b-7bit`

### Known Issue

- [ ] The issue hasn't been already addressed in Documentation, Issues, and Discussions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Prefill Failure of Qwen3.5 Model Using KV Cache in the mlx‑lm Framework #37

Description

Key error information

Minimal reproducible code

Problem analysis

Reproduction

Logs

Environment Information

Environment

Known Issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Prefill Failure of Qwen3.5 Model Using KV Cache in the mlx‑lm Framework #37

Description

Description

Key error information

Minimal reproducible code

Problem analysis

Reproduction

Logs

Environment Information

Environment

Known Issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions