Skip to content

QuivrHQ/multimodal-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

47 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Multimodal RAG System

A production-ready multimodal RAG system for document search with advanced natural language understanding. Built with FastAPI, React, Meilisearch, and Temporal for scalable document processing and semantic search.

Key Features:

  • πŸ” Hybrid search (keyword + semantic) across multilingual documents
  • 🧠 Natural language queries with NER and geographical translation
  • πŸ“„ XML document processing (AFP/IPTC NewsML-G2 format)
  • ⚑ Async workflow orchestration with Temporal
  • 🌐 Multilingual support with language-specific indexes
  • πŸ“Š Distributed tracing and observability (Jaeger + OpenTelemetry)

πŸ—οΈ Architecture

The system follows a services-based architecture optimized for GPU workloads and scalability:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Frontend (React)                      β”‚
β”‚                     http://localhost:5173                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚ REST API
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               Main API (FastAPI + Docker)                    β”‚
β”‚                   http://localhost:5050                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β€’ Document Upload & Processing                              β”‚
β”‚  β€’ Natural Language Search (NER, Query Extraction)           β”‚
β”‚  β€’ Hybrid Search (Keyword + Semantic)                        β”‚
β”‚  β€’ Temporal Workflow Orchestration                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                β”‚                β”‚
          β”‚                β”‚                β–Ό
          β”‚                β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚                β”‚     β”‚  GPU Services (Host)   β”‚
          β”‚                β”‚     β”‚  Port 8001             β”‚
          β”‚                β”‚     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
          β”‚                β”‚     β”‚  β€’ Speech-to-Text      β”‚
          β”‚                β”‚     β”‚    (Whisper)           β”‚
          β”‚                β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                β”‚
          β–Ό                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Meilisearch    β”‚  β”‚  LiteLLM Proxy       β”‚
β”‚   Port 7700      β”‚  β”‚  Port 4000           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β€’ Full-text     β”‚  β”‚  β€’ Unified LLM API   β”‚
β”‚  β€’ Vector Search β”‚  β”‚  β€’ OpenAI/Ollama     β”‚
β”‚  β€’ Hybrid Search β”‚  β”‚  β€’ Model Switching   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Background Services (Docker)            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β€’ Temporal Worker (Document Processing)           β”‚
β”‚  β€’ Temporal Server (Workflow Orchestration)        β”‚
β”‚  β€’ PostgreSQL (Temporal Persistence)               β”‚
β”‚  β€’ Redis (LiteLLM Cache)                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

Main API (multimodal-rag-api)

  • Language: Python + FastAPI
  • Deployment: Docker container
  • Responsibilities:
    • REST API endpoints for document management and search
    • Natural language query processing with NER (Named Entity Recognition)
    • Multilingual geographical entity translation
    • Integration with Meilisearch for hybrid search
    • Temporal workflow orchestration

GPU services

  • Deployment: Runs on host machine with GPU access
  • Current Services:
    • Speech-to-Text (STT): Whisper model for audio transcription (port 8001)
  • Future Services:
    • Text Embeddings
    • Image Embeddings
    • Advanced NER models

Frontend

  • Framework: React + Vite
  • Features:
    • Document upload with metadata support
    • Multi-mode search (keyword, semantic, hybrid)
    • Natural language queries
    • Real-time search results with highlighting

Infrastructure Services

  • Meilisearch: High-performance search engine with vector support
  • LiteLLM: Unified proxy for switching between OpenAI and Ollama models
  • Temporal: Workflow orchestration for async document processing
  • PostgreSQL: Persistence for Temporal workflows
  • Redis: Caching layer for LiteLLM
  • Jaeger: Distributed tracing with OpenTelemetry for observability

Design Principles

  1. GPU Optimization: GPU-intensive models run as host services for direct hardware access (Mac M1/M2/M3, NVIDIA CUDA)
  2. Services: Independent scaling and deployment of components
  3. Containerization: Docker for reproducible environments (except GPU services)
  4. Workflow Orchestration: Temporal for reliable async processing with retries
  5. Unified LLM API: LiteLLM proxy for seamless model switching (local ↔ cloud)

πŸš€ Quick Start

Prerequisites

Requirement Version Purpose
Docker & Docker Compose Latest Main services (API, Meilisearch, Temporal)
Python 3.11+ GPU services (optional, for audio transcription)
Node.js 18+ Frontend development
Make Any Convenience commands
GPU (Optional) Mac M1/M2/M3 or NVIDIA CUDA Speech-to-text service

1. Environment Setup

# Clone the repository
git clone <repository-url>
cd multimodal-rag

# Create backend environment file
cp backend/multimodal_rag_api/.env.example backend/multimodal_rag_api/.env

# Create frontend environment file
cp frontend/.env.example frontend/.env

# Edit backend/.env and add your API keys:
# - OPENAI_API_KEY (for embeddings via OpenAI)
# - VOYAGE_API_KEY (for multimodal embeddings)
# - Or configure Ollama for local models (see .env.example)

See Configuration section for detailed environment variable setup.

2. Install GPU Dependencies (Optional - For Audio Transcription)

If you want to use the Speech-to-Text feature:

# Install STT service dependencies
make gpu-install

3. Start the Services

Option A: Full Stack with GPU Services

# Start GPU services + all Docker services
make dev-full

Option B: Start Services Separately

# Start Docker services only (no audio transcription)
make dev

# (Optional) In another terminal, start GPU services
make gpu-start

This starts:

4. Start Frontend

# Start frontend development server
make front-dev

Access the application at http://localhost:5173

πŸ‘¨β€πŸ’» Developer Onboarding

First Time Setup

  1. Start all services: make dev (skips audio features, faster start)

  2. Verify services are running:

  3. Start frontend: make front-dev

  4. Test document upload:

    • Place a test XML file in ./data/test.xml
    • Upload via UI: file path = /app/data/test.xml
    • Monitor workflow in Temporal UI
    • Search for content once indexed

Understanding the Codebase

Core files to read first (in order):

  1. API Routes: backend/.../api/controller/routes.py

    • All REST endpoints with detailed docstrings
    • Start here to understand API surface
  2. Document Processing: backend/.../api/services/document_processor.py

    • How XML documents are parsed and chunked
    • Embedding generation and indexing flow
  3. Search Pipeline: backend/.../api/services/nl_search/pipeline/

    • ner_step.py: Entity extraction (countries, cities, dates)
    • filter_builder_step.py: Meilisearch filter generation
    • Natural language query β†’ structured filters
  4. Workflows: backend/.../temporal_worker/workflows.py

    • Document ingestion orchestration
    • Retry policies and error handling
    • Batch processing logic
  5. Configuration: docker-compose.yml

    • Service dependencies and architecture
    • Port mappings and environment variables
    • Well-commented for easy understanding

Common Development Tasks

Add a New API Endpoint

  1. Add route function in api/controller/routes.py
  2. Add service logic in api/services/
  3. Test at http://localhost:5050/docs

Modify Document Processing

  1. Edit api/services/document_processor.py
  2. Restart Temporal worker: docker restart temporal-worker
  3. Test by uploading a document

Change Search Behavior

  1. Modify api/services/meilisearch/client.py or NL search pipeline
  2. API auto-reloads (Docker watch mode enabled)
  3. Test via frontend or Swagger UI

Add a New Entity Type for NL Search

  1. Add entity type to nl_search/pipeline/types.py
  2. Update NER in nl_search/pipeline/ner_step.py
  3. Update filter builder in filter_builder_step.py
  4. Add field mapping in nl_search/schema.py

Debug a Workflow

  1. Open Temporal UI: http://localhost:8080
  2. Find workflow by ID (returned from upload API)
  3. View execution history, inputs, outputs
  4. Check Jaeger for distributed trace

Project Architecture Patterns

  • Factory Pattern: get_text_embedding_client(), get_meilisearch_client()
  • Pipeline Pattern: NL search with composable steps
  • Dependency Injection: FastAPI Depends() for services
  • Async/Await: Throughout for I/O-bound operations
  • Pydantic Models: Type-safe data validation
  • Retry Policies: Temporal for resilient async processing

Testing Your Changes

# 1. Integration test via UI
make dev && make front-dev
# Upload document, search, verify results

# 2. API test via Swagger
# Visit http://localhost:5050/docs
# Try /search, /documents/upload endpoints

# 3. Check workflow execution
# Visit http://localhost:8080
# Verify no failed workflows

# 4. View traces
# Visit http://localhost:16686
# Check end-to-end request flow

Getting Help

🌐 Service URLs

Service URL Description
Frontend http://localhost:5173 React web application
Backend API http://localhost:5050 FastAPI REST endpoints
API Docs http://localhost:5050/docs Interactive Swagger UI
Meilisearch http://localhost:7700 Search engine API
Meilisearch UI http://localhost:24900 Search index dashboard
LiteLLM Proxy http://localhost:4000 Unified LLM API
Temporal UI http://localhost:8080 Workflow monitoring
Jaeger UI http://localhost:16686 Distributed tracing & observability
STT Service http://localhost:8001 Speech-to-text API

βš™οΈ Configuration

Environment Variables Quick Reference

Backend (backend/multimodal_rag_api/.env):

Category Variable Default Description
Core Services
TEMPORAL_HOST temporal:7233 Workflow orchestration server
MEILISEARCH_URL http://meilisearch:7700 Search engine URL
MEILISEARCH_API_KEY masterKey ⚠️ Change in production!
LITELLM_HOST http://host.docker.internal:4000 LLM proxy for embeddings
Text Embeddings
TEXT_EMBEDDING_MODEL ollama/qwen3-embedding:0.6b Local Ollama or openai/text-embedding-3-small
TEXT_EMBEDDING_DIMENSIONS 512 Vector dimension (512, 768, 1536)
TEXT_EMBEDDING_HOST http://host.docker.internal:11434 Ollama server URL
Image Embeddings
IMAGE_EMBEDDING_MODEL voyage/voyage-multimodal-3 Multimodal model for images
IMAGE_EMBEDDING_API_KEY Set your Voyage API key Required for image embeddings
Chat/NL Search
CHAT_MODEL ollama/qwen2.5:7b Chat model for NL query processing
CHAT_HOST http://host.docker.internal:11434 Ollama server
CHAT_TEMPERATURE 0.1 Lower = more deterministic
GPU Services
STT_SERVICE_URL http://host.docker.internal:8001 Speech-to-text service
Observability
LANGFUSE_PUBLIC_KEY (optional) LLM observability platform
LANGFUSE_SECRET_KEY (optional) For production monitoring

Frontend (frontend/.env):

Variable Default Description
VITE_API_BASE_URL http://localhost:5050 Backend API URL
VITE_API_PREFIX /multimodal-rag API route prefix
VITE_ENABLE_UPLOAD true Show document upload UI
VITE_ENABLE_HEALTH_CHECK true Show system health status

Configuration Scenarios

Scenario 1: Local Development (Ollama - No API Keys Needed)

# Use local Ollama for embeddings and chat
TEXT_EMBEDDING_MODEL=ollama/qwen3-embedding:0.6b
TEXT_EMBEDDING_HOST=http://host.docker.internal:11434
CHAT_MODEL=ollama/qwen2.5:7b
CHAT_HOST=http://host.docker.internal:11434

# No API keys required!

Setup Ollama:

# Install Ollama: https://ollama.com
ollama pull qwen3-embedding:0.6b
ollama pull qwen2.5:7b
ollama serve  # Keep running in background

Scenario 2: Cloud APIs (OpenAI)

# Use OpenAI for embeddings
TEXT_EMBEDDING_MODEL=openai/text-embedding-3-small
TEXT_EMBEDDING_DIMENSIONS=1536
OPENAI_API_KEY=sk-your-key-here

# Use OpenAI for chat
CHAT_MODEL=openai/gpt-4o-mini
OPENAI_API_KEY=sk-your-key-here

Scenario 3: Mixed (Local + Cloud)

# Local Ollama for embeddings (free, private)
TEXT_EMBEDDING_MODEL=ollama/qwen3-embedding:0.6b
TEXT_EMBEDDING_HOST=http://host.docker.internal:11434

# Cloud for multimodal images (Voyage)
IMAGE_EMBEDDING_MODEL=voyage/voyage-multimodal-3
VOYAGE_API_KEY=pa-your-voyage-key

# Cloud for chat (better quality)
CHAT_MODEL=openai/gpt-4o-mini
OPENAI_API_KEY=sk-your-key-here

Important Configuration Notes

  1. host.docker.internal: Docker's special DNS name to access services running on the host machine (like Ollama). Use this to connect from Docker containers to host services.

  2. LiteLLM Proxy: All embedding and chat requests go through LiteLLM proxy (litellm_config.yaml) which handles model routing and caching. Modify this file to add new providers.

  3. Production Security:

    • ⚠️ Change MEILISEARCH_API_KEY from masterKey
    • ⚠️ Use secrets manager, not .env files
    • ⚠️ Enable HTTPS with reverse proxy
    • ⚠️ Set resource limits in docker-compose.yml
  4. Vector Dimensions: Must match between:

    • TEXT_EMBEDDING_DIMENSIONS in .env
    • Meilisearch embedder configuration
    • Model's native dimension (truncate if supported)

πŸ“– Usage Guide

Document Upload

  1. Place documents in ./data/ directory (auto-mounted to Docker containers)
  2. Navigate to the Upload tab in the frontend
  3. Enter file paths relative to /app/data/ (e.g., data/document.pdf)
  4. Optionally add metadata in JSON format (e.g., {"category": "research"})
  5. Click "Upload Documents" to start processing
  6. Monitor workflow status in Temporal UI at http://localhost:8080

Supported Formats: PDF, images (JPEG, PNG), audio files (with STT service)

Natural Language Search

The system supports advanced natural language queries with:

  • Named Entity Recognition (NER): Detects countries, cities, dates, organizations
  • Geographical Translation: Translates place names to multiple languages
  • Date Parsing: Understands relative dates ("last month", "yesterday")
  • Filter Generation: Converts entities to Meilisearch filters

Examples:

"documents from France last month"
"reports about Paris from 2024"
"contracts with USA companies"

Search Modes

  1. Keyword Search: Traditional full-text search
  2. Semantic Search: AI-powered vector similarity search
  3. Hybrid Search: Combines keyword + semantic (adjustable ratio)

API Usage

# Health check
curl http://localhost:5050/multimodal-rag/health

# Natural language search
curl -X POST "http://localhost:5050/multimodal-rag/search" \
  -H "Content-Type: application/json" \
  -d '{
    "q": "solar panels from France",
    "search_type": "hybrid",
    "federated": true,
    "semanticRatio": 0.8
  }'

# Upload documents
curl -X POST "http://localhost:5050/multimodal-rag/documents/upload" \
  -H "Content-Type: application/json" \
  -d '{
    "file_paths": ["data/document.pdf"],
    "metadata": {"category": "research"},
    "batch_size": 100
  }'

# Check workflow status
curl http://localhost:5050/multimodal-rag/jobs/{workflow_id}/status

πŸ“ Project Structure

multimodal-rag/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ multimodal_rag_api/           # Main API package (Docker)
β”‚   β”‚   β”œβ”€β”€ src/multimodal_rag_api/
β”‚   β”‚   β”‚   β”œβ”€β”€ api/                  # REST endpoints & services
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ controller/       # API routes
β”‚   β”‚   β”‚   β”‚   └── services/         # Business logic
β”‚   β”‚   β”‚   β”‚       β”œβ”€β”€ chunking.py
β”‚   β”‚   β”‚   β”‚       β”œβ”€β”€ document_processor.py
β”‚   β”‚   β”‚   β”‚       β”œβ”€β”€ embedding/    # Embedding services
β”‚   β”‚   β”‚   β”‚       β”œβ”€β”€ meilisearch/  # Search integration
β”‚   β”‚   β”‚   β”‚       β”œβ”€β”€ nl_search/    # NL query processing
β”‚   β”‚   β”‚   β”‚       β”‚   └── pipeline/ # NER, translation, filters
β”‚   β”‚   β”‚   β”‚       └── stt/          # STT client
β”‚   β”‚   β”‚   β”œβ”€β”€ temporal_worker/      # Workflow definitions
β”‚   β”‚   β”‚   β”œβ”€β”€ meilisearch_utils/    # Search utilities
β”‚   β”‚   β”‚   └── models/               # Data models & settings
β”‚   β”‚   β”œβ”€β”€ pyproject.toml            # Package dependencies
β”‚   β”‚   β”œβ”€β”€ docker-compose.yml        # Full stack orchestration
β”‚   β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”‚   └── .env.example
β”‚   β”‚
β”‚   β”œβ”€β”€ gpu_services/                 # GPU services (Host)
β”‚   β”‚   β”œβ”€β”€ stt_service/              # Speech-to-Text
β”‚   β”‚   β”‚   β”œβ”€β”€ src/stt_service/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ api.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ backend/          # Whisper backends
β”‚   β”‚   β”‚   β”‚   └── main.py
β”‚   β”‚   β”‚   └── pyproject.toml
β”‚   β”‚   β”œβ”€β”€ start-services.sh         # Service manager
β”‚   β”‚   └── logs/
β”‚   β”‚
β”‚   └── .env.gpu-services.example     # GPU config reference
β”‚
β”œβ”€β”€ frontend/                         # React application
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ components/               # UI components
β”‚   β”‚   β”œβ”€β”€ services/                 # API client
β”‚   β”‚   └── App.jsx
β”‚   β”œβ”€β”€ package.json
β”‚   └── .env.example
β”‚
β”œβ”€β”€ data/                             # Document storage (mounted)
β”œβ”€β”€ config/                           # Configuration files
β”œβ”€β”€ Makefile                          # Development commands
└── README.md

πŸ› οΈ Development Commands

# Display all commands
make help

# Docker Services
make dev              # Start Docker services only
make dev-full         # Start GPU + Docker services

# GPU Services
make gpu-install      # Install GPU dependencies (first time)
make gpu-start        # Start GPU services
make gpu-stop         # Stop GPU services
make gpu-status       # Check GPU services status
make gpu-logs         # View GPU services logs (real-time)

# Frontend
make front-dev        # Start development server (port 5173)
make front            # Build & preview production (port 4173)

πŸ› Troubleshooting

Common Issues

  1. Port conflicts: Ensure ports 5050, 5173, 7700, 8080, 4000, 8001 are available
  2. Docker issues: Run docker-compose down -v for full reset
  3. GPU services: Check logs with make gpu-logs
  4. Ollama connection: Ensure Ollama is running on host (ollama serve)
  5. File permissions: Verify ./data/ directory is accessible

Service Health Checks

# Backend
curl http://localhost:5050/multimodal-rag/health

# Meilisearch
curl http://localhost:7700/health

# LiteLLM
curl http://localhost:4000/health

# STT Service (if running)
curl http://localhost:8001/health

View Logs

# All Docker services
docker-compose -f backend/multimodal_rag_api/docker-compose.yml logs -f

# Specific service
docker-compose -f backend/multimodal_rag_api/docker-compose.yml logs -f multimodal-rag-api

# GPU services
make gpu-logs

# Temporal workflows
# Visit http://localhost:8080

Cleanup

# Stop all services
docker-compose -f backend/multimodal_rag_api/docker-compose.yml down
make gpu-stop

# Full reset (removes volumes)
docker-compose -f backend/multimodal_rag_api/docker-compose.yml down -v

# Remove Meilisearch data
rm -rf backend/multimodal_rag_api/.meili_data

πŸ§ͺ Testing

Integration Testing

  1. Start services: make dev-full
  2. Start frontend: make front-dev
  3. Upload a test document via UI
  4. Monitor workflow in Temporal UI (http://localhost:8080)
  5. Perform natural language search
  6. Verify results with proper filtering and highlighting

API Testing

Interactive API documentation at http://localhost:5050/docs

Monitoring Tools

πŸš€ Production Deployment

Deployment Scenarios

Scenario 1: Local Development (Mac with GPU)

# GPU services on host, main API in Docker
STT_SERVICE_URL=http://host.docker.internal:8001
NL_SEARCH_OLLAMA_BASE_URL=http://host.docker.internal:11434/v1

Scenario 2: Cloud GPU (Production)

# GPU services on separate instance
STT_SERVICE_URL=https://gpu-server.example.com:8001
NL_SEARCH_PROVIDER=openai
OPENAI_API_KEY=sk-...

Scenario 3: Kubernetes

  • Deploy main API, Meilisearch, Temporal as pods
  • Run GPU services on GPU-enabled nodes
  • Use service discovery for inter-service communication

Production Checklist

  • Replace masterKey with secure Meilisearch API key
  • Configure production database for Temporal (not SQLite)
  • Set up proper secrets management (not .env files)
  • Enable HTTPS with reverse proxy (nginx, Caddy)
  • Configure resource limits in docker-compose.yml
  • Set up monitoring and alerting
  • Configure backup strategy for Meilisearch data
  • Use production-grade LLM API keys with rate limits

Frontend Production Build

# Build and preview
make front

# Build only
cd frontend && npm run build

# Serve with nginx, Caddy, or any static file server

πŸ”— Additional Resources

πŸ›οΈ Key Architectural Decisions

Understanding the "why" behind design choices helps make better decisions when extending the system:

Why Language-Specific Meilisearch Indexes?

Decision: Separate indexes per language (documents-en, documents-es, etc.)

Why:

  • Better tokenization (language-specific word splitting, stemming)
  • Language-specific stopwords ("the", "and" in English vs "le", "et" in French)
  • Easier to tune relevance per language
  • Cleaner federated search results (merge via guid deduplication)

Alternative considered: Single multilingual index with language field

  • ❌ Worse search quality (generic tokenization)
  • ❌ Can't optimize per language

Why GPU Services Run on Host (Not Docker)?

Decision: STT service runs directly on host machine, not in Docker container

Why:

  • Direct GPU access (MPS on Mac, CUDA on NVIDIA)
  • Avoid Docker GPU passthrough complexity
  • Better performance (no virtualization overhead)
  • Simpler debugging (native Python environment)

Trade-off: Less portable, requires host setup

  • βœ… Worth it for 2-3x performance gain

Why Temporal for Document Processing?

Decision: Use Temporal workflows instead of simple async tasks

Why:

  • Reliability: Automatic retries with exponential backoff
  • Visibility: Track execution history, debug failures
  • Scalability: Process thousands of documents with continue-as-new
  • State persistence: Workflows survive crashes and restarts
  • Batching: Control concurrency (max_in_flight_documents)

Alternative considered: Celery, plain async tasks

  • ❌ Manual retry logic, no execution history, harder to debug

Why LiteLLM Proxy?

Decision: Route all LLM/embedding calls through LiteLLM proxy

Why:

  • Provider switching: Change from Ollama β†’ OpenAI without code changes
  • Caching: Redis cache saves API costs (30-70% hit rate)
  • Unified API: One interface for OpenAI, Cohere, Anthropic, Ollama, etc.
  • Observability: Built-in tracing and cost tracking

Alternative considered: Direct API calls per provider

  • ❌ Provider lock-in, no caching, inconsistent APIs

Why Chunking Documents?

Decision: Split large documents into smaller chunks (1000-2000 characters)

Why:

  1. Token limits: Embedding models have max input size (512-8192 tokens)
  2. Search precision: Match specific passages, not entire documents
  3. Better highlighting: Show relevant excerpts to users
  4. Reduced noise: Avoid diluting relevance with irrelevant content

How it works: Hierarchical splitting (paragraphs β†’ sentences β†’ characters) with overlap

Why Gazetteer-First NER?

Decision: Use dictionary matching before ML models for entity extraction

Why:

  • Speed: 10-100x faster than ML models
  • Accuracy: 99%+ for known entities (countries, cities)
  • No hallucination: Exact matches only
  • Offline: No API calls needed
  • ML fallback: Available for ambiguous cases

Strategy: Gazetteer β†’ Country codes β†’ Dates β†’ ML model (only for uncovered text)

Why Matryoshka Embeddings?

Decision: Support truncating embeddings to smaller dimensions

Why:

  • Storage: 512D uses 4x less space than 2048D
  • Speed: Faster similarity search (fewer dimensions)
  • Flexibility: Tune storage/accuracy trade-off per use case
  • No retraining: Models like nomic-embed support this natively

Example: 2048D β†’ 512D with minimal quality loss (~2-3% accuracy drop)

🀝 Contributing

Before contributing, please:

  1. Read the Developer Onboarding section
  2. Understand the Key Architectural Decisions
  3. Check code documentation (all critical files have comprehensive docstrings)

Contribution workflow:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes with clear commit messages
  4. Add docstrings to new functions (follow existing patterns)
  5. Test with make dev and make front-dev
  6. Verify workflows in Temporal UI (no failed executions)
  7. Check Jaeger traces for performance issues
  8. Push to your branch and open a Pull Request

Code style:

  • Follow existing patterns (factory functions, dependency injection, async/await)
  • Add type hints (Pydantic models preferred)
  • Write docstrings for public functions (see routes.py for examples)
  • Use descriptive variable names
  • Keep functions focused (single responsibility)

πŸ“„ License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published