Skip to content

Consider swapping out BLIP2 for an actual VLM #5

@spacegoatai

Description

@spacegoatai

The current kalliste pipeline uses BLIP2 for image understanding, but newer Vision-Language Models (VLMs) offer significantly better performance and capabilities.

Current State

  • Using BLIP2 for image captioning and understanding
  • BLIP2 is from 2023 and has been superseded by more capable models

Proposed Alternatives

Consider evaluating and potentially migrating to:

  • GPT-4V / GPT-4o - Excellent multimodal capabilities, API-based
  • Claude 3.5 Sonnet - Strong vision capabilities, good for analysis
  • LLaVA-1.5/1.6 - Open source, good performance, self-hostable
  • InternVL - Strong open source option
  • Qwen-VL - Alibaba's VLM, good multilingual support
  • CogVLM - Good open source alternative

Evaluation Criteria

  • Performance: Accuracy on image understanding tasks
  • Speed: Inference time and throughput
  • Cost: API costs vs self-hosting requirements
  • Integration: Ease of integration with existing pipeline
  • Capabilities: Support for complex reasoning, multiple images, etc.
  • Licensing: Commercial usage requirements

Implementation Plan

  1. Research and benchmark candidate VLMs
  2. Create evaluation dataset from existing kalliste use cases
  3. Implement proof of concept with top 2-3 candidates
  4. Performance and cost analysis
  5. Migration plan with fallback options

Benefits

  • Better image understanding and description quality
  • More sophisticated reasoning about image content
  • Potential for advanced features (object counting, spatial reasoning, etc.)
  • Future-proofing the pipeline with more capable models

This upgrade could significantly improve the quality of kalliste's image processing and analysis capabilities.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions