BiModernVBert is a vision-language model built on the ModernVBert architecture that generates embeddings for both images and text in a shared 768-dimensional vector space. Unlike multi-vector models (ColPali, Jina v4), BiModernVBert uses single-vector embeddings with simple cosine similarity, making it efficient for large-scale document retrieval while maintaining strong performance.
- Single-Vector Embeddings: 768-dimensional vectors for both images and text
- Normalized Embeddings: L2-normalized, ready for efficient cosine similarity
- Simple Architecture: No multi-vector complexity or pooling required
- Efficient Retrieval: Fast similarity search with single vectors
- Zero-Shot Classification: Use text prompts to classify images without training
- Document Understanding: Optimized for visual document analysis
- Built-in Scoring: Uses processor's efficient cosine similarity implementation
Retrieval Pipeline:
dataset.compute_embeddings(model, embeddings_field="embeddings")
└─> embed_images()
└─> processor.process_images(imgs)
└─> model(**inputs)
└─> Returns (batch, 768) normalized embeddings
└─> Stores in FiftyOne for cosine similarity searchClassification Pipeline:
dataset.apply_model(model, label_field="predictions")
└─> _predict_all()
└─> Get image embeddings (batch, 768)
└─> Get text embeddings for classes (num_classes, 768)
└─> processor.score(text_embs, image_embs) # Cosine similarity
└─> Returns classification logits
└─> Output processor → Classification labelsNote: This model requires the colpali-engine package which provides the BiModernVBert implementation.
# Install FiftyOne and BiModernVBert dependencies
pip install fiftyone torch transformers pillow
pip install git+https://github.com/illuin-tech/colpali.git@vbert#egg=colpali-engineimport fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub
# Load document dataset from Hugging Face
dataset = load_from_hub(
"Voxel51/document-haystack-10pages",
overwrite=True,
max_samples=250 # Optional: subset for testing
)import fiftyone.zoo as foz
# Register this repository as a remote zoo model source
foz.register_zoo_model_source(
"https://github.com/harpreetsahota204/bimodernvbert",
overwrite=True
)import fiftyone.zoo as foz
import fiftyone.brain as fob
# Load BiModernVBert model
model = foz.load_zoo_model("ModernVBERT/bimodernvbert")
# Compute embeddings for all documents
dataset.compute_embeddings(
model=model,
embeddings_field="bimodernvbert_embeddings"
)
# Check embedding dimensions
print(dataset.first()['bimodernvbert_embeddings'].shape) # (768,)
# Build similarity index
text_img_index = fob.compute_similarity(
dataset,
model="ModernVBERT/bimodernvbert",
embeddings_field="bimodernvbert_embeddings",
brain_key="bimodernvbert_sim"
)
# Query for specific content
results = text_img_index.sort_by_similarity(
"invoice from 2024",
k=10 # Top 10 results
)
# Launch FiftyOne App
session = fo.launch_app(results, auto=False)Create 2D visualizations of your document embeddings:
import fiftyone.brain as fob
# First compute embeddings
dataset.compute_embeddings(
model=model,
embeddings_field="bimodernvbert_embeddings"
)
# Create UMAP visualization
results = fob.compute_visualization(
dataset,
method="umap", # Also supports "tsne", "pca"
brain_key="bimodernvbert_viz",
embeddings="bimodernvbert_embeddings",
num_dims=2
)
# Explore in the App
session = fo.launch_app(dataset)Score how representative each sample is of your dataset:
import fiftyone.brain as fob
# Compute representativeness scores
fob.compute_representativeness(
dataset,
representativeness_field="bimodernvbert_represent",
method="cluster-center",
embeddings="bimodernvbert_embeddings"
)
# Find most representative samples
representative_view = dataset.sort_by("bimodernvbert_represent", reverse=True)Find and remove near-duplicate documents:
import fiftyone.brain as fob
# Detect duplicates using embeddings
results = fob.compute_uniqueness(
dataset,
embeddings="bimodernvbert_embeddings"
)
# Filter to most unique samples
unique_view = dataset.sort_by("uniqueness", reverse=True)BiModernVBert supports zero-shot classification using cosine similarity between image and text embeddings:
import fiftyone.zoo as foz
# Load model with classes for classification
model = foz.load_zoo_model(
"ModernVBERT/bimodernvbert",
classes=["invoice", "receipt", "form", "contract", "other"],
text_prompt="This document is a"
)
# Apply model for zero-shot classification
dataset.apply_model(
model,
label_field="document_type_predictions"
)
# View predictions
print(dataset.first()['document_type_predictions'])
session = fo.launch_app(dataset)import fiftyone.zoo as foz
# Load model once
model = foz.load_zoo_model("ModernVBERT/bimodernvbert")
# Task 1: Classify document types
model.classes = ["invoice", "receipt", "form", "contract"]
model.text_prompt = "This is a"
dataset.apply_model(model, label_field="doc_type")
# Task 2: Classify importance (reuse same model!)
model.classes = ["high_priority", "medium_priority", "low_priority"]
model.text_prompt = "The priority level is"
dataset.apply_model(model, label_field="priority")
# Task 3: Classify language
model.classes = ["english", "spanish", "french", "german", "chinese"]
model.text_prompt = "The document language is"
dataset.apply_model(model, label_field="language")-
Model Hub: ModernVBERT/bimodernvbert
-
ColPali Engine: colpali-engine
-
FiftyOne Docs: docs.voxel51.com
-
Base Architecture: ModernVBert
If you use BiModernVBert in your research, please cite:
@misc{teiletche2025modernvbertsmallervisualdocument,
title={ModernVBERT: Towards Smaller Visual Document Retrievers},
author={Paul Teiletche and Quentin Macé and Max Conti and Antonio Loison and Gautier Viaud and Pierre Colombo and Manuel Faysse},
year={2025},
eprint={2510.01149},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2510.01149},
}
- Model: MIT
- Integration Code: Apache 2.0 (see LICENSE)
Found a bug or have a feature request? Please open an issue on GitHub!
- ModernVBERT Team for the excellent BiModernVBert model
- ColPali Engine for the model implementation and processor
- Voxel51 for the FiftyOne framework and brain module architecture
- HuggingFace for model hosting
