[Bug]: mllama AssertionError during kv cache profiling

### Your current environment

Repro command below.

### 🐛 Describe the bug

Attempting to serve `meta-llama/Llama-3.2-11B-Vision-Instruct` with recent vLLM (>=v0.7.3), results in the error below during the execution of `determine_num_available_blocks()` during bootup

```
$ vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --max-num-seqs 8
```

```
Traceback (most recent call last):
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 400, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 125, in from_engine_args
    return cls(ipc_path=ipc_path,
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 77, in __init__
    self.engine = LLMEngine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/llm_engine.py", line 277, in __init__
    self._initialize_kv_caches()
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/llm_engine.py", line 426, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/executor/executor_base.py", line 102, in determine_num_available_blocks
    results = self.collective_rpc("determine_num_available_blocks")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/executor/executor_base.py", line 316, in collective_rpc
    return self._run_workers(method, *args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
    driver_worker_output = run_method(self.driver_worker, sent_method,
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/utils.py", line 2196, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/opt/vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/worker/enc_dec_model_runner.py", line 341, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/opt/vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/worker/enc_dec_model_runner.py", line 182, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/model_executor/models/mllama.py", line 1392, in forward
    assert actual_len >= last_group_len
```

I have done some investigations, but do not have a fix yet... Here is what I have found:
- the error occurs because the dummy encoder sequences constructed for profiling are longer than the actual encoder len computed in mllama; for the single-image requests, this means greater than 6404 tokens
- serving the model works as long as `max_seq_len / max_num_seqs <= 6404`; with the full seq length `--max-num-seq=21` works
- I think this bug was introduced in https://github.com/vllm-project/vllm/pull/11427
    - before this PR there was a `dummy_encoder_data_for_mllama` function responsible for constructing the dummy data


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: mllama AssertionError during kv cache profiling #13929

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: mllama AssertionError during kv cache profiling #13929

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions