Skip to content

[Bug]: [V1] qwen2-vl broken for video inputs #14528

@robertgshaw2-redhat

Description

@robertgshaw2-redhat

Your current environment

As per comment. other video models are fine (e.g. llava-next-video)

🐛 Describe the bug

VLLM_USE_V1=1 pytest -v -s models/decoder_only/vision_language/test_models.py::test_video_models -k qwen2_vl
models/decoder_only/vision_language/test_models.py::test_video_models[qwen2_vl-test_case1] WARNING 03-10 01:17:17 [config.py:2571] Casting torch.bfloat16 to torch.float16.
INFO 03-10 01:17:17 [config.py:576] This model supports multiple tasks: {'score', 'embed', 'classify', 'generate', 'reward'}. Defaulting to 'generate'.
INFO 03-10 01:17:17 [config.py:1670] Chunked prefill is enabled with max_num_batched_tokens=16384.
WARNING 03-10 01:17:17 [cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 03-10 01:17:21 [__init__.py:256] Automatically detected platform cuda.
INFO 03-10 01:17:24 [core.py:51] Initializing a V1 LLM engine (v0.7.2) with config: model='Qwen/Qwen2-VL-2B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-VL-2B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2-VL-2B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
INFO 03-10 01:17:24 [__init__.py:30] Available plugins for group vllm.general_plugins:
INFO 03-10 01:17:24 [__init__.py:32] name=register_dummy_model, value=vllm_add_dummy_model:register
INFO 03-10 01:17:24 [__init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 03-10 01:17:24 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 03-10 01:17:24 [__init__.py:44] plugin register_dummy_model loaded.
WARNING 03-10 01:17:27 [utils.py:2304] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7715f4e6c0e0>
INFO 03-10 01:17:29 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-10 01:17:29 [cuda.py:215] Using Flash Attention backend on V1 engine.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
INFO 03-10 01:17:30 [gpu_model_runner.py:1114] Starting to load model Qwen/Qwen2-VL-2B-Instruct...
WARNING 03-10 01:17:30 [vision.py:94] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
INFO 03-10 01:17:30 [config.py:3173] cudagraph sizes specified by model runner [] is overridden by config []
INFO 03-10 01:17:32 [topk_topp_sampler.py:36] Using FlashInfer for top-p & top-k sampling.
INFO 03-10 01:17:32 [weight_utils.py:257] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  2.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  3.97it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  3.52it/s]

INFO 03-10 01:17:33 [loader.py:429] Loading weights took 0.58 seconds
INFO 03-10 01:17:33 [gpu_model_runner.py:1126] Model loading took 4.1512 GB and 2.876038 seconds
INFO 03-10 01:17:33 [gpu_model_runner.py:1278] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
INFO 03-10 01:17:42 [kv_cache_utils.py:537] GPU KV cache size: 2,313,664 tokens
INFO 03-10 01:17:42 [kv_cache_utils.py:540] Maximum concurrency for 4,096 tokens per request: 564.86x
INFO 03-10 01:17:42 [core.py:120] init engine (profile, create kv cache, warmup model) took 8.90 seconds
Processed prompts:   0%|                                               | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]ERROR 03-10 01:17:44 [core.py:324] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 317, in run_engine_core
ERROR 03-10 01:17:44 [core.py:324]     engine_core.run_busy_loop()
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 351, in run_busy_loop
ERROR 03-10 01:17:44 [core.py:324]     outputs = step_fn()
ERROR 03-10 01:17:44 [core.py:324]               ^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 174, in step
ERROR 03-10 01:17:44 [core.py:324]     output = self.model_executor.execute_model(scheduler_output)
ERROR 03-10 01:17:44 [core.py:324]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/v1/executor/abstract.py", line 80, in execute_model
ERROR 03-10 01:17:44 [core.py:324]     output = self.collective_rpc("execute_model",
ERROR 03-10 01:17:44 [core.py:324]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 03-10 01:17:44 [core.py:324]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 03-10 01:17:44 [core.py:324]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/utils.py", line 2238, in run_method
ERROR 03-10 01:17:44 [core.py:324]     return func(*args, **kwargs)
ERROR 03-10 01:17:44 [core.py:324]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 03-10 01:17:44 [core.py:324]     return func(*args, **kwargs)
ERROR 03-10 01:17:44 [core.py:324]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/v1/worker/gpu_worker.py", line 226, in execute_model
ERROR 03-10 01:17:44 [core.py:324]     output = self.model_runner.execute_model(scheduler_output)
ERROR 03-10 01:17:44 [core.py:324]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 03-10 01:17:44 [core.py:324]     return func(*args, **kwargs)
ERROR 03-10 01:17:44 [core.py:324]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/v1/worker/gpu_model_runner.py", line 922, in execute_model
ERROR 03-10 01:17:44 [core.py:324]     self._update_states(scheduler_output)
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/v1/worker/gpu_model_runner.py", line 360, in _update_states
ERROR 03-10 01:17:44 [core.py:324]     MRotaryEmbedding.get_input_positions_tensor(
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/model_executor/layers/rotary_embedding.py", line 1010, in get_input_positions_tensor
ERROR 03-10 01:17:44 [core.py:324]     video_second_per_grid_t = second_per_grid_ts[video_index]
ERROR 03-10 01:17:44 [core.py:324]                               ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324] IndexError: list index out of range
ERROR 03-10 01:17:44 [core.py:324] 
CRITICAL 03-10 01:17:44 [core_client.py:260] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions