-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
The current kalliste pipeline uses BLIP2 for image understanding, but newer Vision-Language Models (VLMs) offer significantly better performance and capabilities.
Current State
- Using BLIP2 for image captioning and understanding
- BLIP2 is from 2023 and has been superseded by more capable models
Proposed Alternatives
Consider evaluating and potentially migrating to:
- GPT-4V / GPT-4o - Excellent multimodal capabilities, API-based
- Claude 3.5 Sonnet - Strong vision capabilities, good for analysis
- LLaVA-1.5/1.6 - Open source, good performance, self-hostable
- InternVL - Strong open source option
- Qwen-VL - Alibaba's VLM, good multilingual support
- CogVLM - Good open source alternative
Evaluation Criteria
- Performance: Accuracy on image understanding tasks
- Speed: Inference time and throughput
- Cost: API costs vs self-hosting requirements
- Integration: Ease of integration with existing pipeline
- Capabilities: Support for complex reasoning, multiple images, etc.
- Licensing: Commercial usage requirements
Implementation Plan
- Research and benchmark candidate VLMs
- Create evaluation dataset from existing kalliste use cases
- Implement proof of concept with top 2-3 candidates
- Performance and cost analysis
- Migration plan with fallback options
Benefits
- Better image understanding and description quality
- More sophisticated reasoning about image content
- Potential for advanced features (object counting, spatial reasoning, etc.)
- Future-proofing the pipeline with more capable models
This upgrade could significantly improve the quality of kalliste's image processing and analysis capabilities.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels