support videochat by xufang-lisa · Pull Request #1637 · huggingface/optimum-intel

xufang-lisa · 2026-03-13T08:27:11Z

Conversion cmd-line for OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B:
optimum-cli export openvino -m OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B ./VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B

Inference of OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B using OpenVINO backend:

from transformers import AutoTokenizer, AutoProcessor
from transformers.video_utils import load_video
from optimum.intel.openvino import OVModelForVisualCausalLM

model_dir = "./VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B"

model = OVModelForVisualCausalLM.from_pretrained(model_dir, trust_remote_code=True, device="cpu")

# Prepare video input
video_path = "./253997_tiny.mp4"
input_video, _ = load_video(video_path, num_frames=8, backend="opencv")
question = "Describe this video in detail."

# preprocess inputs
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
inputs = model.preprocess_inputs(processor=None, tokenizer=tokenizer, text=question, video=input_video, config=model.config)

# Run inference
output_ids = model.generate(**inputs, max_new_tokens=10)
input_prompt_len = inputs["input_ids"].shape[-1]
generated = output_ids[:, input_prompt_len:]
output_text = tokenizer.decode(generated[0])

print(output_text)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Copilot

Pull request overview

This PR adds OpenVINO export/runtime support for the videochat_flash_qwen visual-language model type, including model-specific export configs/patchers and test coverage updates.

Changes:

Register videochat_flash_qwen in the OpenVINO tasks/config system, including multi-part export behaviors (language, vision embeddings, vision projection, text embeddings).
Introduce model patchers to make the model’s forwards export-friendly for OpenVINO conversion.
Update OpenVINO tests/export harnesses and CLI checks to include the new architecture.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
`optimum/intel/openvino/modeling_visual_language.py`	Adds an OV runtime wrapper subclass for `videochat_flash_qwen` and hooks it into the model-type mapping.
`optimum/exporters/openvino/model_configs.py`	Adds OpenVINO export configs + dummy input generators + TasksManager registration for `videochat_flash_qwen` and its sub-behaviors.
`optimum/exporters/openvino/model_patcher.py`	Adds patchers for language, vision embedding, and vision projection submodules for export.
`optimum/exporters/openvino/utils.py`	Marks `videochat_flash_qwen` as a VLM model type for exporter utilities.
`optimum/intel/openvino/configuration.py`	Adds a default quantization preset for the upstream VideoChat-Flash model id.
`optimum/exporters/openvino/convert.py`	Adjusts config override application when exporting models without `.config`.
`optimum/commands/export/openvino.py`	Adds a CLI guard for a known transformers-version incompatibility for the upstream model repo.
`tests/openvino/utils_tests.py`	Adds a tiny internal test model mapping and remote-code allowlist entry.
`tests/openvino/test_seq2seq.py`	Expands VLM integration tests to include `videochat_flash_qwen` with model-specific skips/paths.
`tests/openvino/test_export.py`	Includes `videochat_flash_qwen` in export test coverage and loading flow.
`setup.py`	Adds additional test dependencies related to video IO.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

optimum/exporters/openvino/convert.py

optimum/exporters/openvino/model_configs.py

setup.py

optimum/exporters/openvino/model_patcher.py

optimum/intel/openvino/modeling_visual_language.py

optimum/exporters/openvino/model_configs.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

xufang-lisa · 2026-04-01T01:08:06Z

Update description by taking the template from #1551

Done

rkazants · 2026-04-02T08:31:05Z

Hi @echarlaix, @IlyasMoutawwakil, could you please review this PR?

Thanks,
Roman

rkazants · 2026-04-03T04:13:21Z

optimum/exporters/openvino/model_configs.py

+    VideochatFlashQwenLanguageModelPatcher,
+    VideochatFlashQwenVisionEmbeddingModelPatcher,
+    VideochatFlashQwenVisionProjectionModelPatcher,


Suggested change

VideochatFlashQwenLanguageModelPatcher,

VideochatFlashQwenVisionEmbeddingModelPatcher,

VideochatFlashQwenVisionProjectionModelPatcher,

VideoChatFlashQwenLanguageModelPatcher,

VideoChatFlashQwenVisionEmbeddingModelPatcher,

VideoChatFlashQwenVisionProjectionModelPatcher,

for consistency

rkazants · 2026-04-03T04:14:08Z

optimum/exporters/openvino/model_configs.py

+        self.height = 224
+        self.width = 224


please make a comment why they are fixed

rkazants · 2026-04-03T04:15:04Z

optimum/exporters/openvino/model_configs.py

+        self.task = task
+        self.batch_size = batch_size
+        self.hidden_size = normalized_config.config.mm_hidden_size
+        self.num_patches = 64


please make a comment why it is fixed.

rkazants · 2026-04-03T04:21:56Z

optimum/exporters/openvino/model_configs.py

+
+@register_in_tasks_manager("videochat_flash_qwen", *["image-text-to-text"], library_name="transformers")
+class VideoChatFlashQwenOpenVINOConfig(BaseVLMOpenVINOConfig):
+    MIN_TRANSFORMERS_VERSION = "4.45.0"


Suggested change

MIN_TRANSFORMERS_VERSION = "4.45.0"

MIN_TRANSFORMERS_VERSION = "4.49.0"

let use this limitation and remove below check with exception

rkazants · 2026-04-03T04:22:49Z

optimum/exporters/openvino/model_configs.py

+        if not self._behavior == VideoChatFlashQwenConfigBehavior.VISION_EMBEDDINGS:
+            return {}
+        return {
+            "hidden_states": {0: "batch_size", 1: "num_channels", 2: "num_frames", 3: "height", 4: "width"},


strange that num_channels is dynamic, please double-check

num_channels has been removed.

rkazants · 2026-04-03T04:25:13Z

optimum/exporters/openvino/model_patcher.py

+
+            if isinstance(hidden_states, tuple) and len(hidden_states) == 2:
+                hidden_states, residual = hidden_states
+                if residual is not None:


why do we need this check? Is any model were residial is None

residual is always None and has been removed. The type of hidden_states is tensor not tuple and this check has also been removed.

rkazants · 2026-04-03T04:26:06Z

optimum/exporters/openvino/model_patcher.py

+            if self.sep_pos_embed:
+                raise NotImplementedError


Not sure that this is needed. Please clean code in this patcher.

rkazants · 2026-04-03T04:38:38Z

optimum/exporters/openvino/model_patcher.py

+        model: "PreTrainedModel",
+        model_kwargs: Dict[str, Any] = None,
+    ):
+        model.__orig_forward = model.forward


why do you need this patching? You just need to use original names and patching is not needed

please remove this patch

rkazants · 2026-04-03T04:41:39Z

optimum/intel/openvino/modeling_visual_language.py

@@ -1,14 +1,16 @@
+import ast


this module is quite old and release last time in 2017. Do we really need this one?

Replace it with a simple string-to-list conversion.

rkazants · 2026-04-03T04:52:09Z

optimum/intel/openvino/modeling_visual_language.py

+        grid_h = np.arange(grid_size, dtype=np.float32)
+        grid_w = np.arange(grid_size, dtype=np.float32)
+        grid = np.meshgrid(grid_w, grid_h)  # here w goes first
+        grid = np.stack(grid, axis=0)


can you please torch? I think we mostly use torch in processing parts in optimum-intel. Let us be aligned.

The same comment is for code below

rkazants · 2026-04-03T05:20:45Z

optimum/intel/openvino/modeling_visual_language.py

+                            (
+                                num_patch_width,
+                                num_patch_height,
+                            ) = _OVVideoChatFlashQwenForCausalLM.get_anyres_image_grid_shape(
+                                image_sizes[image_idx],
+                                self.config.image_grid_pinpoints,
+                                vision_tower_image_size,
+                                max_resolutions=None,
+                            )
+                        except Exception:
+                            logger.exception("Error while computing anyres image grid shape")
+                            raise
+                            # num_patch_width, num_patch_height = 2, 2


why do we need try-catch block?

ast.literal_eval could potentially raise an exception within get_anyres_image_grid_shape, so there is an exception-handling block here. Now ast.literal_eval and try-cash have been replaced. By the way, grid_pinpoints in the configuration file is a list not a string.

rkazants

Please clean-up code to remove extra checks with NotImplemented exceptions, try-catch blocks. I don't understand the meaning of these checks because now we don't expect any model that will fall into them.

rkazants · 2026-04-03T05:24:56Z

optimum/intel/openvino/modeling_visual_language.py

+        self.img_pos_embed.data.copy_(torch.from_numpy(img_pos_embed).float().unsqueeze(0))
+
+    # Adopted from https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B/blob/main/mm_projector_builder.py#L6
+    def bipartite_soft_matching(


please leave a comment what is this function doing? Describe this stage in the pipeline.

rkazants · 2026-04-03T05:26:36Z

optimum/intel/openvino/modeling_visual_language.py

+        if isinstance(images, list):
+            raise NotImplementedError
+        else:


please remove

rkazants · 2026-04-03T05:26:56Z

optimum/intel/openvino/modeling_visual_language.py

+            # input: B T C H W
+            # output: B T*L C


please leave a comment why it is needed

rkazants · 2026-04-03T05:27:19Z

optimum/intel/openvino/modeling_visual_language.py

+        head = self.num_attention_heads
+
+        dim = c // head
+        for r in r_merge_list:


what is the merge_list?

Comment is added.

rkazants · 2026-04-03T05:27:53Z

optimum/intel/openvino/modeling_visual_language.py

+        """
+        size = None
+        b, p, c = x.shape
+        tmp_p = p


please use proper naming for variable. tmp_p?

It is renamed to current_num_tokens

xufang-lisa · 2026-04-03T14:30:47Z

@rkazants A default image_preprocess implementation has been added to _OVVideoChatFlashQwenForCausalLM to provide image preprocessing when the processor is unavailable. Please take a look e1dba19

xufang-lisa added 11 commits February 9, 2026 13:38

support videochat_flash_qwen

c4eb689

fix error

3748931

remove unused function

64fb029

fix hidden_size for vision projectio

e748336

Merge branch 'main' into xufang/support_videochat

4001249

add preprocess_inputs

553c49d

set default quantization config for videochat model

97c1226

add rotary_pos_embed to vision_embedding

74a8e5f

update vision projection input name

a0af467

use mm_hidden_size as embed_dim

70056d0

Merge branch 'main' into xufang/support_videochat

c5d0807

xufang-lisa marked this pull request as ready for review March 17, 2026 09:21

xufang-lisa added 8 commits March 17, 2026 23:53

add check for videochat

67f33c2

Add pipeline for VideoChat

b44b15d

support text only

7bb536d

remove unused code

2cf85fd

add videochat test

2ad5818

add test dependencies

74dcc9d

fix style check issue

80df5c2

fix style check issue

f7835a0

rkazants requested review from IlyasMoutawwakil, Copilot, echarlaix, popovaan and rkazants March 23, 2026 16:31

Copilot started reviewing on behalf of rkazants March 23, 2026 16:32 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

xufang-lisa and others added 3 commits March 24, 2026 09:13

Apply suggestions from code review

82e4c22

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

apply code review comments

3a0e310

fix code style

190a796

fix code style

2cd0214

rkazants reviewed Apr 3, 2026

View reviewed changes

xufang-lisa added 2 commits April 3, 2026 15:45

add default image_preprocess

e1dba19

apply comments

0f1907d

xufang-lisa added 7 commits April 3, 2026 22:44

fix code style

8a2811d

update tests

9bca64a

update tests

ea1a5a0

remove NotImplemented exceptions

33b17e8

fix code style

b67cdf0

test videochat export when transformers>=4.49

b4bdb50

fix code style

72c94da

	MIN_TRANSFORMERS_VERSION = "4.45.0"
	MIN_TRANSFORMERS_VERSION = "4.49.0"

Conversation

xufang-lisa commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before submitting

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xufang-lisa commented Apr 1, 2026

Uh oh!

rkazants commented Apr 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xufang-lisa Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkazants left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

xufang-lisa commented Mar 13, 2026 •

edited

Loading

xufang-lisa Apr 3, 2026 •

edited

Loading