Compare and benchmark image-to-text models from OpenAI and AWS Bedrock on the XTD10 dataset—measure accuracy, latency, and cost in one place.
- Automatic dataset setup
Downloads and extracts the XTD10 multilingual image corpus. - Multi-model captioning
Generates captions using OpenAI GPT-4o variants and AWS Bedrock Nova Lite/Pro. - LLM-based evaluation
Scores generated captions against ground truth via a judge LLM. - Comprehensive metrics
Aggregates accuracy, latency, and cost; exports results as CSV.
- Python 3.8+
- OpenAI API Key — set
OPENAI_API_KEY - AWS Credentials with Bedrock access — set
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY(andAWS_SESSION_TOKENif required)
git clone https://github.com/tavily-ai/image-caption-evaluator.git
cd image-caption-evaluator
pip install -r requirements.txtpython run_evaluation.pyThe script will:
- Download & extract images (if needed)
- Fetch captions for the chosen language
- Generate and evaluate captions across all models
- Save
results.csvwith per-image metrics
A CSV with columns:
| image_filename | model | similarity_score | latency | cost_usd | … |
|---|
Use your favorite plotting library to visualize trade-offs.
- Fork the repo
- Create a feature branch
- Submit a PR
Ideas welcome:
- Add new LLM providers
- Support batching or async evaluation
- Extend to other vision-language tasks
Questions or custom integrations? Reach out to Tomer Weiss:
- Email:
- Tomer Weiss - Data Scientist @ Tavily
- Eyal Ben Barouch - Head of Data @ Tavily
Powered by Tavily — The web API built for AI agents