Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds OpenVINO export/runtime support for the videochat_flash_qwen visual-language model type, including model-specific export configs/patchers and test coverage updates.
Changes:
- Register
videochat_flash_qwenin the OpenVINO tasks/config system, including multi-part export behaviors (language, vision embeddings, vision projection, text embeddings). - Introduce model patchers to make the model’s forwards export-friendly for OpenVINO conversion.
- Update OpenVINO tests/export harnesses and CLI checks to include the new architecture.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
optimum/intel/openvino/modeling_visual_language.py |
Adds an OV runtime wrapper subclass for videochat_flash_qwen and hooks it into the model-type mapping. |
optimum/exporters/openvino/model_configs.py |
Adds OpenVINO export configs + dummy input generators + TasksManager registration for videochat_flash_qwen and its sub-behaviors. |
optimum/exporters/openvino/model_patcher.py |
Adds patchers for language, vision embedding, and vision projection submodules for export. |
optimum/exporters/openvino/utils.py |
Marks videochat_flash_qwen as a VLM model type for exporter utilities. |
optimum/intel/openvino/configuration.py |
Adds a default quantization preset for the upstream VideoChat-Flash model id. |
optimum/exporters/openvino/convert.py |
Adjusts config override application when exporting models without .config. |
optimum/commands/export/openvino.py |
Adds a CLI guard for a known transformers-version incompatibility for the upstream model repo. |
tests/openvino/utils_tests.py |
Adds a tiny internal test model mapping and remote-code allowlist entry. |
tests/openvino/test_seq2seq.py |
Expands VLM integration tests to include videochat_flash_qwen with model-specific skips/paths. |
tests/openvino/test_export.py |
Includes videochat_flash_qwen in export test coverage and loading flow. |
setup.py |
Adds additional test dependencies related to video IO. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Done |
|
Hi @echarlaix, @IlyasMoutawwakil, could you please review this PR? Thanks, |
| VideochatFlashQwenLanguageModelPatcher, | ||
| VideochatFlashQwenVisionEmbeddingModelPatcher, | ||
| VideochatFlashQwenVisionProjectionModelPatcher, |
There was a problem hiding this comment.
| VideochatFlashQwenLanguageModelPatcher, | |
| VideochatFlashQwenVisionEmbeddingModelPatcher, | |
| VideochatFlashQwenVisionProjectionModelPatcher, | |
| VideoChatFlashQwenLanguageModelPatcher, | |
| VideoChatFlashQwenVisionEmbeddingModelPatcher, | |
| VideoChatFlashQwenVisionProjectionModelPatcher, |
for consistency
| self.height = 224 | ||
| self.width = 224 |
There was a problem hiding this comment.
please make a comment why they are fixed
| self.task = task | ||
| self.batch_size = batch_size | ||
| self.hidden_size = normalized_config.config.mm_hidden_size | ||
| self.num_patches = 64 |
There was a problem hiding this comment.
please make a comment why it is fixed.
|
|
||
| @register_in_tasks_manager("videochat_flash_qwen", *["image-text-to-text"], library_name="transformers") | ||
| class VideoChatFlashQwenOpenVINOConfig(BaseVLMOpenVINOConfig): | ||
| MIN_TRANSFORMERS_VERSION = "4.45.0" |
There was a problem hiding this comment.
| MIN_TRANSFORMERS_VERSION = "4.45.0" | |
| MIN_TRANSFORMERS_VERSION = "4.49.0" |
let use this limitation and remove below check with exception
| if not self._behavior == VideoChatFlashQwenConfigBehavior.VISION_EMBEDDINGS: | ||
| return {} | ||
| return { | ||
| "hidden_states": {0: "batch_size", 1: "num_channels", 2: "num_frames", 3: "height", 4: "width"}, |
There was a problem hiding this comment.
strange that num_channels is dynamic, please double-check
There was a problem hiding this comment.
num_channels has been removed.
|
|
||
| if isinstance(hidden_states, tuple) and len(hidden_states) == 2: | ||
| hidden_states, residual = hidden_states | ||
| if residual is not None: |
There was a problem hiding this comment.
why do we need this check? Is any model were residial is None
There was a problem hiding this comment.
residual is always None and has been removed. The type of hidden_states is tensor not tuple and this check has also been removed.
| if self.sep_pos_embed: | ||
| raise NotImplementedError |
There was a problem hiding this comment.
Not sure that this is needed. Please clean code in this patcher.
| model: "PreTrainedModel", | ||
| model_kwargs: Dict[str, Any] = None, | ||
| ): | ||
| model.__orig_forward = model.forward |
There was a problem hiding this comment.
why do you need this patching? You just need to use original names and patching is not needed
| @@ -1,14 +1,16 @@ | |||
| import ast | |||
There was a problem hiding this comment.
this module is quite old and release last time in 2017. Do we really need this one?
There was a problem hiding this comment.
Replace it with a simple string-to-list conversion.
| grid_h = np.arange(grid_size, dtype=np.float32) | ||
| grid_w = np.arange(grid_size, dtype=np.float32) | ||
| grid = np.meshgrid(grid_w, grid_h) # here w goes first | ||
| grid = np.stack(grid, axis=0) |
There was a problem hiding this comment.
can you please torch? I think we mostly use torch in processing parts in optimum-intel. Let us be aligned.
There was a problem hiding this comment.
The same comment is for code below
| ( | ||
| num_patch_width, | ||
| num_patch_height, | ||
| ) = _OVVideoChatFlashQwenForCausalLM.get_anyres_image_grid_shape( | ||
| image_sizes[image_idx], | ||
| self.config.image_grid_pinpoints, | ||
| vision_tower_image_size, | ||
| max_resolutions=None, | ||
| ) | ||
| except Exception: | ||
| logger.exception("Error while computing anyres image grid shape") | ||
| raise | ||
| # num_patch_width, num_patch_height = 2, 2 |
There was a problem hiding this comment.
why do we need try-catch block?
There was a problem hiding this comment.
ast.literal_eval could potentially raise an exception within get_anyres_image_grid_shape, so there is an exception-handling block here. Now ast.literal_eval and try-cash have been replaced. By the way, grid_pinpoints in the configuration file is a list not a string.
rkazants
left a comment
There was a problem hiding this comment.
Please clean-up code to remove extra checks with NotImplemented exceptions, try-catch blocks. I don't understand the meaning of these checks because now we don't expect any model that will fall into them.
| self.img_pos_embed.data.copy_(torch.from_numpy(img_pos_embed).float().unsqueeze(0)) | ||
|
|
||
| # Adopted from https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B/blob/main/mm_projector_builder.py#L6 | ||
| def bipartite_soft_matching( |
There was a problem hiding this comment.
please leave a comment what is this function doing? Describe this stage in the pipeline.
| if isinstance(images, list): | ||
| raise NotImplementedError | ||
| else: |
| # input: B T C H W | ||
| # output: B T*L C |
There was a problem hiding this comment.
please leave a comment why it is needed
| head = self.num_attention_heads | ||
|
|
||
| dim = c // head | ||
| for r in r_merge_list: |
There was a problem hiding this comment.
Comment is added.
| """ | ||
| size = None | ||
| b, p, c = x.shape | ||
| tmp_p = p |
There was a problem hiding this comment.
please use proper naming for variable. tmp_p?
There was a problem hiding this comment.
It is renamed to current_num_tokens
Conversion cmd-line for OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B:
optimum-cli export openvino -m OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B ./VideoChat-Flash-Qwen2_5-7B_InternVideo2-1BInference of OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B using OpenVINO backend:
Before submitting