Repro command below.
$ vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --max-num-seqs 8
Traceback (most recent call last):
File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 400, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 125, in from_engine_args
return cls(ipc_path=ipc_path,
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 77, in __init__
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/llm_engine.py", line 277, in __init__
self._initialize_kv_caches()
File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/llm_engine.py", line 426, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/executor/executor_base.py", line 102, in determine_num_available_blocks
results = self.collective_rpc("determine_num_available_blocks")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/executor/executor_base.py", line 316, in collective_rpc
return self._run_workers(method, *args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
driver_worker_output = run_method(self.driver_worker, sent_method,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/utils.py", line 2196, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
self.model_runner.profile_run()
File "/opt/vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/worker/enc_dec_model_runner.py", line 341, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/opt/vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/worker/enc_dec_model_runner.py", line 182, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/model_executor/models/mllama.py", line 1392, in forward
assert actual_len >= last_group_len
I have done some investigations, but do not have a fix yet... Here is what I have found:
Your current environment
Repro command below.
🐛 Describe the bug
Attempting to serve
meta-llama/Llama-3.2-11B-Vision-Instructwith recent vLLM (>=v0.7.3), results in the error below during the execution ofdetermine_num_available_blocks()during bootupI have done some investigations, but do not have a fix yet... Here is what I have found:
max_seq_len / max_num_seqs <= 6404; with the full seq length--max-num-seq=21worksdummy_encoder_data_for_mllamafunction responsible for constructing the dummy dataBefore submitting a new issue...