Benchmark suite for Vision Language Models. Compare accuracy, cost, and speed across frontier VLMs on standardized visual tasks.
- VQA / OCR -- visual question answering and optical character recognition
- Anthropic (Claude)
- Google (Gemini)
- OpenAI (GPT)
pip install vlm-examOr install from source:
git clone https://github.com/roboflow/vlm-exam.git
cd vlm-exam
pip install -e ".[dev]"export ANTHROPIC_API_KEY=...
export GOOGLE_API_KEY=...
export OPENAI_API_KEY=...
vlm-exam run \
--task vqa \
--models claude-fable-5,gemini-3.5-flash,gpt-5.5 \
--effort high \
--dataset-directory /path/to/vqa/dataset
vlm-exam report --results-directory results/from vlm_exam import load_config, create_provider, create_task, run_benchmark
config = load_config()
task = create_task("vqa")
samples = task.load_samples("/path/to/vqa/dataset")
provider = create_provider("anthropic", model="claude-fable-5")
results = run_benchmark(task=task, provider=provider, samples=samples, effort="high")Model definitions, pricing, and lab branding live in
src/vlm_exam/configs/models.yaml. Add a new model by editing this file --
no code changes required.
Apache 2.0. See LICENSE.