Skip to content

[OpenVINO] Add image-text-to-embedding task for Qwen2VL#1649

Open
jianxiah-intel wants to merge 2 commits intohuggingface:mainfrom
jianxiah-intel:image-text-to-embedding
Open

[OpenVINO] Add image-text-to-embedding task for Qwen2VL#1649
jianxiah-intel wants to merge 2 commits intohuggingface:mainfrom
jianxiah-intel:image-text-to-embedding

Conversation

@jianxiah-intel
Copy link
Copy Markdown

@jianxiah-intel jianxiah-intel commented Mar 25, 2026

What does this PR do?

This PR adds image-text-to-embedding task support for Qwen2VL-based models in the OpenVINO exporter.

  • Added Qwen2VLEmbeddingPatcher to patch the model forward for export, replacing the full causal LM forward with a backbone-only call that outputs last_hidden_state.
  • Added LMEmbeddingConfigHelper and registered the image-text-to-embedding task in the tasks manager for qwen2_vl.
  • Implemented OVModelForImageTextToEmbedding as the inference class, along with OVLMEmbeddingModel (stateless LM submodel without KV-cache) and _OVQwen2VLForEmbedding.
  • Introduced MODEL_TYPE_TO_IMAGE_TEXT_TO_EMBEDDING_CLS_MAPPING as a dedicated dispatch table for the new task, parallel to the existing MODEL_TYPE_TO_CLS_MAPPING.

Support for additional architectures (e.g., Qwen3VL Embedding) will be added in follow-up PRs.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Add a new task type that runs the full Qwen2VL (for now) multimodal pipeline through a stateless language model backbone (no lm_head, no KV-cache) and returns raw last_hidden_state [B, T, D].
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant