[Bug]: [V1] qwen2-vl broken for video inputs

### Your current environment

As per comment. other video models are fine (e.g. llava-next-video)

### 🐛 Describe the bug

```bash
VLLM_USE_V1=1 pytest -v -s models/decoder_only/vision_language/test_models.py::test_video_models -k qwen2_vl
```

```bash
models/decoder_only/vision_language/test_models.py::test_video_models[qwen2_vl-test_case1] WARNING 03-10 01:17:17 [config.py:2571] Casting torch.bfloat16 to torch.float16.
INFO 03-10 01:17:17 [config.py:576] This model supports multiple tasks: {'score', 'embed', 'classify', 'generate', 'reward'}. Defaulting to 'generate'.
INFO 03-10 01:17:17 [config.py:1670] Chunked prefill is enabled with max_num_batched_tokens=16384.
WARNING 03-10 01:17:17 [cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 03-10 01:17:21 [__init__.py:256] Automatically detected platform cuda.
INFO 03-10 01:17:24 [core.py:51] Initializing a V1 LLM engine (v0.7.2) with config: model='Qwen/Qwen2-VL-2B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-VL-2B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2-VL-2B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
INFO 03-10 01:17:24 [__init__.py:30] Available plugins for group vllm.general_plugins:
INFO 03-10 01:17:24 [__init__.py:32] name=register_dummy_model, value=vllm_add_dummy_model:register
INFO 03-10 01:17:24 [__init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 03-10 01:17:24 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 03-10 01:17:24 [__init__.py:44] plugin register_dummy_model loaded.
WARNING 03-10 01:17:27 [utils.py:2304] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7715f4e6c0e0>
INFO 03-10 01:17:29 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-10 01:17:29 [cuda.py:215] Using Flash Attention backend on V1 engine.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
INFO 03-10 01:17:30 [gpu_model_runner.py:1114] Starting to load model Qwen/Qwen2-VL-2B-Instruct...
WARNING 03-10 01:17:30 [vision.py:94] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
INFO 03-10 01:17:30 [config.py:3173] cudagraph sizes specified by model runner [] is overridden by config []
INFO 03-10 01:17:32 [topk_topp_sampler.py:36] Using FlashInfer for top-p & top-k sampling.
INFO 03-10 01:17:32 [weight_utils.py:257] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  2.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  3.97it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  3.52it/s]

INFO 03-10 01:17:33 [loader.py:429] Loading weights took 0.58 seconds
INFO 03-10 01:17:33 [gpu_model_runner.py:1126] Model loading took 4.1512 GB and 2.876038 seconds
INFO 03-10 01:17:33 [gpu_model_runner.py:1278] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
INFO 03-10 01:17:42 [kv_cache_utils.py:537] GPU KV cache size: 2,313,664 tokens
INFO 03-10 01:17:42 [kv_cache_utils.py:540] Maximum concurrency for 4,096 tokens per request: 564.86x
INFO 03-10 01:17:42 [core.py:120] init engine (profile, create kv cache, warmup model) took 8.90 seconds
Processed prompts:   0%|                                               | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]ERROR 03-10 01:17:44 [core.py:324] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 317, in run_engine_core
ERROR 03-10 01:17:44 [core.py:324]     engine_core.run_busy_loop()
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 351, in run_busy_loop
ERROR 03-10 01:17:44 [core.py:324]     outputs = step_fn()
ERROR 03-10 01:17:44 [core.py:324]               ^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 174, in step
ERROR 03-10 01:17:44 [core.py:324]     output = self.model_executor.execute_model(scheduler_output)
ERROR 03-10 01:17:44 [core.py:324]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/v1/executor/abstract.py", line 80, in execute_model
ERROR 03-10 01:17:44 [core.py:324]     output = self.collective_rpc("execute_model",
ERROR 03-10 01:17:44 [core.py:324]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 03-10 01:17:44 [core.py:324]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 03-10 01:17:44 [core.py:324]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/utils.py", line 2238, in run_method
ERROR 03-10 01:17:44 [core.py:324]     return func(*args, **kwargs)
ERROR 03-10 01:17:44 [core.py:324]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 03-10 01:17:44 [core.py:324]     return func(*args, **kwargs)
ERROR 03-10 01:17:44 [core.py:324]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/v1/worker/gpu_worker.py", line 226, in execute_model
ERROR 03-10 01:17:44 [core.py:324]     output = self.model_runner.execute_model(scheduler_output)
ERROR 03-10 01:17:44 [core.py:324]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 03-10 01:17:44 [core.py:324]     return func(*args, **kwargs)
ERROR 03-10 01:17:44 [core.py:324]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/v1/worker/gpu_model_runner.py", line 922, in execute_model
ERROR 03-10 01:17:44 [core.py:324]     self._update_states(scheduler_output)
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/v1/worker/gpu_model_runner.py", line 360, in _update_states
ERROR 03-10 01:17:44 [core.py:324]     MRotaryEmbedding.get_input_positions_tensor(
ERROR 03-10 01:17:44 [core.py:324]   File "/home/rshaw/vllm/vllm/model_executor/layers/rotary_embedding.py", line 1010, in get_input_positions_tensor
ERROR 03-10 01:17:44 [core.py:324]     video_second_per_grid_t = second_per_grid_ts[video_index]
ERROR 03-10 01:17:44 [core.py:324]                               ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
ERROR 03-10 01:17:44 [core.py:324] IndexError: list index out of range
ERROR 03-10 01:17:44 [core.py:324] 
CRITICAL 03-10 01:17:44 [core_client.py:260] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed
```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: [V1] qwen2-vl broken for video inputs #14528

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: [V1] qwen2-vl broken for video inputs #14528

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions