Skip to content

[OpenVINO] export and inference support for Ministral3 (ministral3) VLM#1659

Open
dhandhalyabhavik wants to merge 1 commit intohuggingface:mainfrom
dhandhalyabhavik:ministral-support
Open

[OpenVINO] export and inference support for Ministral3 (ministral3) VLM#1659
dhandhalyabhavik wants to merge 1 commit intohuggingface:mainfrom
dhandhalyabhavik:ministral-support

Conversation

@dhandhalyabhavik
Copy link
Copy Markdown

Enable mistralai/Ministral-3-3B-Instruct-2512 (model_type: mistral3) for OpenVINO export via optimum-cli and inference via OVModelForVisualCausalLM.

The model is a Vision-Language Model with a PixtralVisionModel encoder, Mistral3MultiModalProjector with 2x2 PatchMerger, and MinistralModel decoder.

Changes:

  • model_configs.py: Add Mistral3OpenVINOConfig with VLM sub-model configs, register _CUSTOM_CLASSES for mistral3, register ministral3 in CONFIG_MAPPING
  • model_patcher.py: Add _mistral3_vision_embed_forward (inlines full vision pipeline to keep all shapes dynamic in OpenVINO IR), add Mistral3ImageEmbeddingModelPatcher and Mistral3LanguageModelPatcher (fixes sliding_window=None crash in MinistralModel)
  • utils.py: Add mistral3 to MULTI_MODAL_TEXT_GENERATION_MODELS
  • main.py: Add FP8 dequantization handling for FP8-quantized source models
  • modeling_visual_language.py: Add _OVMistral3ForCausalLM inference class with masked_scatter-based vision/text embedding merge, add to MODEL_TYPE_TO_CLS_MAPPING

Tested with FP16, INT8, and INT4 exports — text-only and image+text inference produce correct results across all precisions.

Tested multi-turn conversation with following chat-ministral.py file

#!/usr/bin/env python3
"""
Ministral-3B OpenVINO Terminal Chat
Interactive chat interface using the INT4 model.
Type 'exit' or 'quit' to stop. Type 'clear' to reset conversation history.
"""

import time
from optimum.intel.openvino import OVModelForVisualCausalLM
from transformers import AutoProcessor

MODEL_DIR = "/workspace/ministral-3b-ov"
MAX_NEW_TOKENS = 1024

# The model's built-in chat template injects a ~540-token default system prompt
# with unresolved {today}/{yesterday} variables, tool-use instructions, and
# directives to ask clarifying questions — all of which degrade multi-turn quality.
# We override it with a clean, short system message instead.
SYSTEM_PROMPT = (
    "You are a helpful AI assistant. Answer questions clearly and concisely. "
    "Do not ask unnecessary clarifying questions."
)

GENERATION_KWARGS = {"repetition_penalty": 1.2}


def generate_reply(model, processor, history):
    text = processor.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=text, return_tensors="pt")

    t0 = time.time()
    output = model.generate(**inputs, max_new_tokens=MAX_NEW_TOKENS, **GENERATION_KWARGS)
    elapsed = time.time() - t0

    n_input = inputs["input_ids"].shape[1]
    n_generated = output.shape[1] - n_input
    reply = processor.decode(output[0][n_input:], skip_special_tokens=True)
    tps = n_generated / elapsed if elapsed > 0 else 0

    return reply, n_generated, elapsed, tps


def main():
    print("Loading Ministral-3B INT4 model from", MODEL_DIR)
    model = OVModelForVisualCausalLM.from_pretrained(MODEL_DIR, device="GPU")
    processor = AutoProcessor.from_pretrained(MODEL_DIR)
    print("Model loaded. Type 'exit' to quit, 'clear' to reset history.\n")

    history = [{"role": "system", "content": SYSTEM_PROMPT}]

    while True:
        try:
            user_input = input("You: ").strip()
        except (EOFError, KeyboardInterrupt):
            print("\nGoodbye!")
            break

        if not user_input:
            continue
        if user_input.lower() in ("exit", "quit"):
            print("Goodbye!")
            break
        if user_input.lower() == "clear":
            history = [{"role": "system", "content": SYSTEM_PROMPT}]
            print("Conversation history cleared.\n")
            continue

        history.append({"role": "user", "content": user_input})

        print("Assistant: ", end="", flush=True)
        reply, n_gen, elapsed, tps = generate_reply(model, processor, history)
        print(reply)
        print(f"  [{n_gen} tokens, {elapsed:.1f}s, {tps:.1f} tok/s]\n")

        history.append({"role": "assistant", "content": reply})


if __name__ == "__main__":
    main()

Output:

# python3 chat-ministral.py 
Multiple distributions found for package optimum. Picked distribution: optimum-onnx
Loading Ministral-3B INT4 model from /workspace/ministral-3b-ov
Model loaded. Type 'exit' to quit, 'clear' to reset history.

You: Which are important graph traversal algorithms? just name 3.
Assistant: 1. **Depth-First Search (DFS)**
2. **Breadth-First Search (BFS)**
3. **Dijkstra’s Algorithm**
  [34 tokens, 2.7s, 12.8 tok/s]

You: what is the time complexity of them?   
Assistant: Here are their approximate time complexities:

- **DFS**: *O(V + E)* in worst case, but can be faster with optimizations like recursion stack management.
- **BFS**: *O(V + E)*, same as DFS for general graphs.
- **Dijkstra's Algorithm**: *O((V+E) log V)* using binary heaps or Fibonacci heap implementations; it reduces to O(E + V log V)) without any specific data structure.

*Note: Complexities depend on implementation details.*
  [106 tokens, 7.8s, 13.7 tok/s]

You: Write shortest code for BFS in python.       
Assistant: ```python
from collections import deque

def bfs(graph, start):
    visited = set()
    queue = deque([start])
    visited.add(start)

    while queue:
        node = queue.popleft()
        print(node)
        for neighbor in graph[node]:
            if neighbor not in visited:
                visited.add(neighbor)
                queue.append(neighbor)
\```

This code performs standard breadth-first search (BFS). It takes an adjacency list `graph` and a starting vertex `start`. The function returns nothing but prints nodes in order they're discovered.
  [118 tokens, 8.0s, 14.8 tok/s]

You: Write shortest code for DFS, keep it shortest.
Assistant: ```python
def dfs_recursive(graph, node=None, parent=[]):
    return [node] + sum(
        [(n,) if n != parent else []]
        for n in graph.get(node, [])
    ), end=''
dfs_recursive({0:[1],1:[2]},0)
\```
I realize this was overly complex. Here's a much shorter version that uses Python's default argument behavior:

\```python
def dfs(g,n=None,p=-1,v={}):
 v=n;return g[n][p]+sum(dfs(g,x,y,z)for x in g[n])if y!=x else[])
print(dfs({'a':['b','c'],'b':['d'], 'c':[],'d':[]},'a'))) # Outputs: ['a', 'b', 'd']

# Much simpler recursive approach:
def simple_dfs(graph, start_node):
    path = []
    def _traverse(current_node):
        nonlocal path
        path += current_node,
        for child in graph[current_node]:
            _traverse(child)
    _traverse(start_node)
    return path

simple_dfs({"A": ["B", "C"],"B":["D","F"],"C":["G"],"D":[],"E":[],"F":[],"G":[]},"A") # Should output something similar to: ['A', 'B', 'D']
\```The above examples show different levels of simplicity depending on your needs. For most practical purposes, I recommend the first example which provides a clear balance between readability and functionality.
  [332 tokens, 30.5s, 10.9 tok/s]

Enable mistralai/Ministral-3-3B-Instruct-2512 (model_type: mistral3) for
OpenVINO export via optimum-cli and inference via OVModelForVisualCausalLM.

The model is a Vision-Language Model with a PixtralVisionModel encoder,
Mistral3MultiModalProjector with 2x2 PatchMerger, and MinistralModel decoder.

Changes:
- model_configs.py: Add Mistral3OpenVINOConfig with VLM sub-model configs,
  register _CUSTOM_CLASSES for mistral3, register ministral3 in CONFIG_MAPPING
- model_patcher.py: Add _mistral3_vision_embed_forward (inlines full vision
  pipeline to keep all shapes dynamic in OpenVINO IR), add
  Mistral3ImageEmbeddingModelPatcher and Mistral3LanguageModelPatcher
  (fixes sliding_window=None crash in MinistralModel)
- utils.py: Add mistral3 to MULTI_MODAL_TEXT_GENERATION_MODELS
- __main__.py: Add FP8 dequantization handling for FP8-quantized source models
- modeling_visual_language.py: Add _OVMistral3ForCausalLM inference class with
  masked_scatter-based vision/text embedding merge, add to MODEL_TYPE_TO_CLS_MAPPING

Tested with FP16, INT8, and INT4 exports — text-only and image+text inference
produce correct results across all precisions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dhandhalyabhavik dhandhalyabhavik changed the title Add OpenVINO export and inference support for Mistral3 (mistral3) VLM Add OpenVINO export and inference support for Ministral3 (ministral3) VLM Mar 27, 2026
@dhandhalyabhavik dhandhalyabhavik changed the title Add OpenVINO export and inference support for Ministral3 (ministral3) VLM [OpenVINO] export and inference support for Ministral3 (ministral3) VLM Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant