Moondream2 is a powerful vision-language model that can be used with FiftyOne for various image understanding tasks. This implementation allows you to easily integrate Moondream2 into your FiftyOne workflows.
NOTE: Due to recent changes in Transformers 4.50.0 (which are to be patched by Hugging Face) please ensure you have transformers<=4.49.0 installed before running the model
pip install transformers<=4.49.0
# Add other dependencies as neededThe model supports five main operations:
-
Caption Generation (
caption)- Generates image descriptions with adjustable length (short, normal, long)
- No prompt required
-
Visual Question Answering (
query)- Answers specific questions about image content
- Requires a text prompt/question
-
Object Detection (
detect)- Locates objects in images with bounding boxes
- Requires a prompt specifying what to detect
-
Point Identification (
point)- Identifies specific points of interest in images
- Requires a prompt specifying what points to identify
-
Classification (
classify)- Performs zero-shot classification on images
- Requires a prompt specifying the possible classes to choose from
The model automatically selects the best available device:
- CUDA (GPU) if available
- Apple Metal (MPS) if available
- CPU as fallback
- The model requires local installation of model files
- Symbolic links are automatically created for custom model code
- Make sure to set appropriate operations and prompts before running inference
This repository provides an implementation of Moondream2 as a remotely sourced model for the FiftyOne computer vision toolkit.
To use Moondream2 with FiftyOne, follow these steps:
- Register the model source:
import fiftyone as fo
import fiftyone.zoo as foz
foz.register_zoo_model_source("https://github.com/harpreetsahota204/moondream2", overwrite=True)- Download the model:
foz.download_zoo_model(
"https://github.com/harpreetsahota204/moondream2",
model_name="vikhyatk/moondream2"
)- Load the model:
model = foz.load_zoo_model(
"vikhyatk/moondream2",
revision="2025-03-27",
# install_requirements=True #if you are using for the first time and need to download reuirement,
# ensure_requirements=True # ensure any requirements are installed before loading the model
)The same model instance can be used for different operations by simply changing its properties:
Moondream2 supports three caption length options: short, normal, and long.
model.operation = "caption"
model.length = "short"
dataset.apply_model(
model,
label_field="short_captions",
)model.length = "long"
dataset.apply_model(
model,
label_field="long_captions",
)Classify images in a zero-shot manner
model.operation="classify"
model.prompt= "surfer, wave, bird" # you can also pass a Python list: ["surfer", "wave", "bird"]
dataset.apply_model(model, label_field="classification")Detect specific objects in images by providing a prompt:
model.operation = "detect"
model.prompt = "surfer, wave, bird" # you can also pass a Python list: ["surfer", "wave", "bird"]
dataset.apply_model(model, label_field="detections")Identify keypoints for specific object types:
model.operation = "point"
model.prompt = "surfer, wave, bird" # you can also pass a Python list: ["surfer", "wave", "bird"]
dataset.apply_model(model, label_field="pointings")Ask questions about the content of images. This can be used in a variety of ways, for example you can ask it to perfom OCR.
model.operation = "query"
model.prompt = "What is in the background of the image"
dataset.apply_model(model, label_field="vqa_response")You can also use fields from your dataset as prompts:
# Set a field with questions for each sample
dataset.set_values("questions", ["Where is the general location of this scene?"] * len(dataset))
# Use that field as the prompt source
dataset.apply_model(
model,
label_field="query_field_response",
prompt_field="questions"
)Moondream2 returns different types of output depending on the operation:
-
caption: Returns a string containing the image description -
query: Returns a string containing the answer to the question -
classify: Returnsfiftyone.core.labels.Classificationsobject containing a single classification label -
detect: Returnsfiftyone.core.labels.Detectionsobject containing:- Normalized bounding boxin the range [0,1]
- Label field containing the detected object class
-
point: Returnsfiftyone.core.labels.Keypointsobject containing:- Normalized point coordinates
[x, y]in range [0,1] x [0,1] - Label field containing the point class
- Normalized point coordinates
@misc{moondream2024,
author = {Korrapati, Vikhyat and others},
title = {Moondream: A Tiny Vision Language Model},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
url = {https://github.com/vikhyat/moondream},
commit = {main}
}