Skip to content

support videochat#1637

Open
xufang-lisa wants to merge 42 commits intohuggingface:mainfrom
xufang-lisa:xufang/support_videochat
Open

support videochat#1637
xufang-lisa wants to merge 42 commits intohuggingface:mainfrom
xufang-lisa:xufang/support_videochat

Conversation

@xufang-lisa
Copy link
Copy Markdown
Contributor

@xufang-lisa xufang-lisa commented Mar 13, 2026

Conversion cmd-line for OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B:
optimum-cli export openvino -m OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B ./VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B

Inference of OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B using OpenVINO backend:

from transformers import AutoTokenizer, AutoProcessor
from transformers.video_utils import load_video
from optimum.intel.openvino import OVModelForVisualCausalLM

model_dir = "./VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B"

model = OVModelForVisualCausalLM.from_pretrained(model_dir, trust_remote_code=True, device="cpu")

# Prepare video input
video_path = "./253997_tiny.mp4"
input_video, _ = load_video(video_path, num_frames=8, backend="opencv")
question = "Describe this video in detail."

# preprocess inputs
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
inputs = model.preprocess_inputs(processor=None, tokenizer=tokenizer, text=question, video=input_video, config=model.config)

# Run inference
output_ids = model.generate(**inputs, max_new_tokens=10)
input_prompt_len = inputs["input_ids"].shape[-1]
generated = output_ids[:, input_prompt_len:]
output_text = tokenizer.decode(generated[0])

print(output_text)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@xufang-lisa xufang-lisa marked this pull request as ready for review March 17, 2026 09:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds OpenVINO export/runtime support for the videochat_flash_qwen visual-language model type, including model-specific export configs/patchers and test coverage updates.

Changes:

  • Register videochat_flash_qwen in the OpenVINO tasks/config system, including multi-part export behaviors (language, vision embeddings, vision projection, text embeddings).
  • Introduce model patchers to make the model’s forwards export-friendly for OpenVINO conversion.
  • Update OpenVINO tests/export harnesses and CLI checks to include the new architecture.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
optimum/intel/openvino/modeling_visual_language.py Adds an OV runtime wrapper subclass for videochat_flash_qwen and hooks it into the model-type mapping.
optimum/exporters/openvino/model_configs.py Adds OpenVINO export configs + dummy input generators + TasksManager registration for videochat_flash_qwen and its sub-behaviors.
optimum/exporters/openvino/model_patcher.py Adds patchers for language, vision embedding, and vision projection submodules for export.
optimum/exporters/openvino/utils.py Marks videochat_flash_qwen as a VLM model type for exporter utilities.
optimum/intel/openvino/configuration.py Adds a default quantization preset for the upstream VideoChat-Flash model id.
optimum/exporters/openvino/convert.py Adjusts config override application when exporting models without .config.
optimum/commands/export/openvino.py Adds a CLI guard for a known transformers-version incompatibility for the upstream model repo.
tests/openvino/utils_tests.py Adds a tiny internal test model mapping and remote-code allowlist entry.
tests/openvino/test_seq2seq.py Expands VLM integration tests to include videochat_flash_qwen with model-specific skips/paths.
tests/openvino/test_export.py Includes videochat_flash_qwen in export test coverage and loading flow.
setup.py Adds additional test dependencies related to video IO.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

xufang-lisa and others added 3 commits March 24, 2026 09:13
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@xufang-lisa
Copy link
Copy Markdown
Contributor Author

Update description by taking the template from #1551

Done

@rkazants
Copy link
Copy Markdown
Collaborator

rkazants commented Apr 2, 2026

Hi @echarlaix, @IlyasMoutawwakil, could you please review this PR?

Thanks,
Roman

Comment on lines +203 to +205
VideochatFlashQwenLanguageModelPatcher,
VideochatFlashQwenVisionEmbeddingModelPatcher,
VideochatFlashQwenVisionProjectionModelPatcher,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VideochatFlashQwenLanguageModelPatcher,
VideochatFlashQwenVisionEmbeddingModelPatcher,
VideochatFlashQwenVisionProjectionModelPatcher,
VideoChatFlashQwenLanguageModelPatcher,
VideoChatFlashQwenVisionEmbeddingModelPatcher,
VideoChatFlashQwenVisionProjectionModelPatcher,

for consistency

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +5326 to +5327
self.height = 224
self.width = 224
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make a comment why they are fixed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self.task = task
self.batch_size = batch_size
self.hidden_size = normalized_config.config.mm_hidden_size
self.num_patches = 64
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make a comment why it is fixed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


@register_in_tasks_manager("videochat_flash_qwen", *["image-text-to-text"], library_name="transformers")
class VideoChatFlashQwenOpenVINOConfig(BaseVLMOpenVINOConfig):
MIN_TRANSFORMERS_VERSION = "4.45.0"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MIN_TRANSFORMERS_VERSION = "4.45.0"
MIN_TRANSFORMERS_VERSION = "4.49.0"

let use this limitation and remove below check with exception

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if not self._behavior == VideoChatFlashQwenConfigBehavior.VISION_EMBEDDINGS:
return {}
return {
"hidden_states": {0: "batch_size", 1: "num_channels", 2: "num_frames", 3: "height", 4: "width"},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange that num_channels is dynamic, please double-check

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_channels has been removed.


if isinstance(hidden_states, tuple) and len(hidden_states) == 2:
hidden_states, residual = hidden_states
if residual is not None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this check? Is any model were residial is None

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

residual is always None and has been removed. The type of hidden_states is tensor not tuple and this check has also been removed.

Comment on lines +7662 to +7663
if self.sep_pos_embed:
raise NotImplementedError
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that this is needed. Please clean code in this patcher.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

model: "PreTrainedModel",
model_kwargs: Dict[str, Any] = None,
):
model.__orig_forward = model.forward
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this patching? You just need to use original names and patching is not needed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this patch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -1,14 +1,16 @@
import ast
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this module is quite old and release last time in 2017. Do we really need this one?

Copy link
Copy Markdown
Contributor Author

@xufang-lisa xufang-lisa Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace it with a simple string-to-list conversion.

Comment on lines +4828 to +4831
grid_h = np.arange(grid_size, dtype=np.float32)
grid_w = np.arange(grid_size, dtype=np.float32)
grid = np.meshgrid(grid_w, grid_h) # here w goes first
grid = np.stack(grid, axis=0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please torch? I think we mostly use torch in processing parts in optimum-intel. Let us be aligned.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment is for code below

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +5462 to +5474
(
num_patch_width,
num_patch_height,
) = _OVVideoChatFlashQwenForCausalLM.get_anyres_image_grid_shape(
image_sizes[image_idx],
self.config.image_grid_pinpoints,
vision_tower_image_size,
max_resolutions=None,
)
except Exception:
logger.exception("Error while computing anyres image grid shape")
raise
# num_patch_width, num_patch_height = 2, 2
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need try-catch block?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ast.literal_eval could potentially raise an exception within get_anyres_image_grid_shape, so there is an exception-handling block here. Now ast.literal_eval and try-cash have been replaced. By the way, grid_pinpoints in the configuration file is a list not a string.

Copy link
Copy Markdown
Collaborator

@rkazants rkazants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clean-up code to remove extra checks with NotImplemented exceptions, try-catch blocks. I don't understand the meaning of these checks because now we don't expect any model that will fall into them.

self.img_pos_embed.data.copy_(torch.from_numpy(img_pos_embed).float().unsqueeze(0))

# Adopted from https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B/blob/main/mm_projector_builder.py#L6
def bipartite_soft_matching(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please leave a comment what is this function doing? Describe this stage in the pipeline.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +5014 to +5016
if isinstance(images, list):
raise NotImplementedError
else:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines +5017 to +5018
# input: B T C H W
# output: B T*L C
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please leave a comment why it is needed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

head = self.num_attention_heads

dim = c // head
for r in r_merge_list:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the merge_list?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is added.

"""
size = None
b, p, c = x.shape
tmp_p = p
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use proper naming for variable. tmp_p?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is renamed to current_num_tokens

@xufang-lisa
Copy link
Copy Markdown
Contributor Author

@rkazants A default image_preprocess implementation has been added to _OVVideoChatFlashQwenForCausalLM to provide image preprocessing when the processor is unavailable. Please take a look e1dba19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants