A production-ready multimodal RAG system for document search with advanced natural language understanding. Built with FastAPI, React, Meilisearch, and Temporal for scalable document processing and semantic search.
Key Features:
- π Hybrid search (keyword + semantic) across multilingual documents
- π§ Natural language queries with NER and geographical translation
- π XML document processing (AFP/IPTC NewsML-G2 format)
- β‘ Async workflow orchestration with Temporal
- π Multilingual support with language-specific indexes
- π Distributed tracing and observability (Jaeger + OpenTelemetry)
The system follows a services-based architecture optimized for GPU workloads and scalability:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend (React) β
β http://localhost:5173 β
βββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
β REST API
βββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ
β Main API (FastAPI + Docker) β
β http://localhost:5050 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β’ Document Upload & Processing β
β β’ Natural Language Search (NER, Query Extraction) β
β β’ Hybrid Search (Keyword + Semantic) β
β β’ Temporal Workflow Orchestration β
βββββββββββ¬βββββββββββββββββ¬βββββββββββββββββ¬ββββββββββββββββββ
β β β
β β βΌ
β β ββββββββββββββββββββββββββ
β β β GPU Services (Host) β
β β β Port 8001 β
β β ββββββββββββββββββββββββββ€
β β β β’ Speech-to-Text β
β β β (Whisper) β
β β ββββββββββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββββββ
β Meilisearch β β LiteLLM Proxy β
β Port 7700 β β Port 4000 β
ββββββββββββββββββββ€ ββββββββββββββββββββββββ€
β β’ Full-text β β β’ Unified LLM API β
β β’ Vector Search β β β’ OpenAI/Ollama β
β β’ Hybrid Search β β β’ Model Switching β
ββββββββββββββββββββ ββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Background Services (Docker) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β’ Temporal Worker (Document Processing) β
β β’ Temporal Server (Workflow Orchestration) β
β β’ PostgreSQL (Temporal Persistence) β
β β’ Redis (LiteLLM Cache) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Language: Python + FastAPI
- Deployment: Docker container
- Responsibilities:
- REST API endpoints for document management and search
- Natural language query processing with NER (Named Entity Recognition)
- Multilingual geographical entity translation
- Integration with Meilisearch for hybrid search
- Temporal workflow orchestration
- Deployment: Runs on host machine with GPU access
- Current Services:
- Speech-to-Text (STT): Whisper model for audio transcription (port 8001)
- Future Services:
- Text Embeddings
- Image Embeddings
- Advanced NER models
- Framework: React + Vite
- Features:
- Document upload with metadata support
- Multi-mode search (keyword, semantic, hybrid)
- Natural language queries
- Real-time search results with highlighting
- Meilisearch: High-performance search engine with vector support
- LiteLLM: Unified proxy for switching between OpenAI and Ollama models
- Temporal: Workflow orchestration for async document processing
- PostgreSQL: Persistence for Temporal workflows
- Redis: Caching layer for LiteLLM
- Jaeger: Distributed tracing with OpenTelemetry for observability
- GPU Optimization: GPU-intensive models run as host services for direct hardware access (Mac M1/M2/M3, NVIDIA CUDA)
- Services: Independent scaling and deployment of components
- Containerization: Docker for reproducible environments (except GPU services)
- Workflow Orchestration: Temporal for reliable async processing with retries
- Unified LLM API: LiteLLM proxy for seamless model switching (local β cloud)
| Requirement | Version | Purpose |
|---|---|---|
| Docker & Docker Compose | Latest | Main services (API, Meilisearch, Temporal) |
| Python | 3.11+ | GPU services (optional, for audio transcription) |
| Node.js | 18+ | Frontend development |
| Make | Any | Convenience commands |
| GPU (Optional) | Mac M1/M2/M3 or NVIDIA CUDA | Speech-to-text service |
# Clone the repository
git clone <repository-url>
cd multimodal-rag
# Create backend environment file
cp backend/multimodal_rag_api/.env.example backend/multimodal_rag_api/.env
# Create frontend environment file
cp frontend/.env.example frontend/.env
# Edit backend/.env and add your API keys:
# - OPENAI_API_KEY (for embeddings via OpenAI)
# - VOYAGE_API_KEY (for multimodal embeddings)
# - Or configure Ollama for local models (see .env.example)See Configuration section for detailed environment variable setup.
If you want to use the Speech-to-Text feature:
# Install STT service dependencies
make gpu-install# Start GPU services + all Docker services
make dev-full# Start Docker services only (no audio transcription)
make dev
# (Optional) In another terminal, start GPU services
make gpu-startThis starts:
- Backend API: http://localhost:5050 (API docs at
/docs) - Meilisearch: http://localhost:7700
- Meilisearch UI: http://localhost:24900
- LiteLLM Proxy: http://localhost:4000
- Temporal UI: http://localhost:8080
- Jaeger UI: http://localhost:16686 (distributed tracing)
- GPU Services (if started): STT on port 8001
# Start frontend development server
make front-devAccess the application at http://localhost:5173
-
Start all services:
make dev(skips audio features, faster start) -
Verify services are running:
- β Backend API: http://localhost:5050/docs (Swagger UI)
- β Meilisearch UI: http://localhost:24900 (inspect indexes)
- β Temporal UI: http://localhost:8080 (workflow monitoring)
- β Jaeger UI: http://localhost:16686 (distributed tracing)
-
Start frontend:
make front-dev -
Test document upload:
- Place a test XML file in
./data/test.xml - Upload via UI: file path =
/app/data/test.xml - Monitor workflow in Temporal UI
- Search for content once indexed
- Place a test XML file in
Core files to read first (in order):
-
API Routes:
backend/.../api/controller/routes.py- All REST endpoints with detailed docstrings
- Start here to understand API surface
-
Document Processing:
backend/.../api/services/document_processor.py- How XML documents are parsed and chunked
- Embedding generation and indexing flow
-
Search Pipeline:
backend/.../api/services/nl_search/pipeline/ner_step.py: Entity extraction (countries, cities, dates)filter_builder_step.py: Meilisearch filter generation- Natural language query β structured filters
-
Workflows:
backend/.../temporal_worker/workflows.py- Document ingestion orchestration
- Retry policies and error handling
- Batch processing logic
-
Configuration:
docker-compose.yml- Service dependencies and architecture
- Port mappings and environment variables
- Well-commented for easy understanding
- Add route function in
api/controller/routes.py - Add service logic in
api/services/ - Test at http://localhost:5050/docs
- Edit
api/services/document_processor.py - Restart Temporal worker:
docker restart temporal-worker - Test by uploading a document
- Modify
api/services/meilisearch/client.pyor NL search pipeline - API auto-reloads (Docker watch mode enabled)
- Test via frontend or Swagger UI
- Add entity type to
nl_search/pipeline/types.py - Update NER in
nl_search/pipeline/ner_step.py - Update filter builder in
filter_builder_step.py - Add field mapping in
nl_search/schema.py
- Open Temporal UI: http://localhost:8080
- Find workflow by ID (returned from upload API)
- View execution history, inputs, outputs
- Check Jaeger for distributed trace
- Factory Pattern:
get_text_embedding_client(),get_meilisearch_client() - Pipeline Pattern: NL search with composable steps
- Dependency Injection: FastAPI
Depends()for services - Async/Await: Throughout for I/O-bound operations
- Pydantic Models: Type-safe data validation
- Retry Policies: Temporal for resilient async processing
# 1. Integration test via UI
make dev && make front-dev
# Upload document, search, verify results
# 2. API test via Swagger
# Visit http://localhost:5050/docs
# Try /search, /documents/upload endpoints
# 3. Check workflow execution
# Visit http://localhost:8080
# Verify no failed workflows
# 4. View traces
# Visit http://localhost:16686
# Check end-to-end request flow- Code documentation: All critical files have comprehensive docstrings
- API docs: http://localhost:5050/docs (interactive, try endpoints)
- Workflow debugging: http://localhost:8080 (execution history)
- Service logs:
docker-compose logs -f [service-name] - External docs: Links in Additional Resources
| Service | URL | Description |
|---|---|---|
| Frontend | http://localhost:5173 | React web application |
| Backend API | http://localhost:5050 | FastAPI REST endpoints |
| API Docs | http://localhost:5050/docs | Interactive Swagger UI |
| Meilisearch | http://localhost:7700 | Search engine API |
| Meilisearch UI | http://localhost:24900 | Search index dashboard |
| LiteLLM Proxy | http://localhost:4000 | Unified LLM API |
| Temporal UI | http://localhost:8080 | Workflow monitoring |
| Jaeger UI | http://localhost:16686 | Distributed tracing & observability |
| STT Service | http://localhost:8001 | Speech-to-text API |
Backend (backend/multimodal_rag_api/.env):
| Category | Variable | Default | Description |
|---|---|---|---|
| Core Services | |||
TEMPORAL_HOST |
temporal:7233 |
Workflow orchestration server | |
MEILISEARCH_URL |
http://meilisearch:7700 |
Search engine URL | |
MEILISEARCH_API_KEY |
masterKey |
||
LITELLM_HOST |
http://host.docker.internal:4000 |
LLM proxy for embeddings | |
| Text Embeddings | |||
TEXT_EMBEDDING_MODEL |
ollama/qwen3-embedding:0.6b |
Local Ollama or openai/text-embedding-3-small |
|
TEXT_EMBEDDING_DIMENSIONS |
512 |
Vector dimension (512, 768, 1536) | |
TEXT_EMBEDDING_HOST |
http://host.docker.internal:11434 |
Ollama server URL | |
| Image Embeddings | |||
IMAGE_EMBEDDING_MODEL |
voyage/voyage-multimodal-3 |
Multimodal model for images | |
IMAGE_EMBEDDING_API_KEY |
Set your Voyage API key | Required for image embeddings | |
| Chat/NL Search | |||
CHAT_MODEL |
ollama/qwen2.5:7b |
Chat model for NL query processing | |
CHAT_HOST |
http://host.docker.internal:11434 |
Ollama server | |
CHAT_TEMPERATURE |
0.1 |
Lower = more deterministic | |
| GPU Services | |||
STT_SERVICE_URL |
http://host.docker.internal:8001 |
Speech-to-text service | |
| Observability | |||
LANGFUSE_PUBLIC_KEY |
(optional) | LLM observability platform | |
LANGFUSE_SECRET_KEY |
(optional) | For production monitoring |
Frontend (frontend/.env):
| Variable | Default | Description |
|---|---|---|
VITE_API_BASE_URL |
http://localhost:5050 |
Backend API URL |
VITE_API_PREFIX |
/multimodal-rag |
API route prefix |
VITE_ENABLE_UPLOAD |
true |
Show document upload UI |
VITE_ENABLE_HEALTH_CHECK |
true |
Show system health status |
# Use local Ollama for embeddings and chat
TEXT_EMBEDDING_MODEL=ollama/qwen3-embedding:0.6b
TEXT_EMBEDDING_HOST=http://host.docker.internal:11434
CHAT_MODEL=ollama/qwen2.5:7b
CHAT_HOST=http://host.docker.internal:11434
# No API keys required!Setup Ollama:
# Install Ollama: https://ollama.com
ollama pull qwen3-embedding:0.6b
ollama pull qwen2.5:7b
ollama serve # Keep running in background# Use OpenAI for embeddings
TEXT_EMBEDDING_MODEL=openai/text-embedding-3-small
TEXT_EMBEDDING_DIMENSIONS=1536
OPENAI_API_KEY=sk-your-key-here
# Use OpenAI for chat
CHAT_MODEL=openai/gpt-4o-mini
OPENAI_API_KEY=sk-your-key-here# Local Ollama for embeddings (free, private)
TEXT_EMBEDDING_MODEL=ollama/qwen3-embedding:0.6b
TEXT_EMBEDDING_HOST=http://host.docker.internal:11434
# Cloud for multimodal images (Voyage)
IMAGE_EMBEDDING_MODEL=voyage/voyage-multimodal-3
VOYAGE_API_KEY=pa-your-voyage-key
# Cloud for chat (better quality)
CHAT_MODEL=openai/gpt-4o-mini
OPENAI_API_KEY=sk-your-key-here-
host.docker.internal: Docker's special DNS name to access services running on the host machine (like Ollama). Use this to connect from Docker containers to host services. -
LiteLLM Proxy: All embedding and chat requests go through LiteLLM proxy (
litellm_config.yaml) which handles model routing and caching. Modify this file to add new providers. -
Production Security:
β οΈ ChangeMEILISEARCH_API_KEYfrommasterKeyβ οΈ Use secrets manager, not.envfilesβ οΈ Enable HTTPS with reverse proxyβ οΈ Set resource limits indocker-compose.yml
-
Vector Dimensions: Must match between:
TEXT_EMBEDDING_DIMENSIONSin.env- Meilisearch embedder configuration
- Model's native dimension (truncate if supported)
- Place documents in
./data/directory (auto-mounted to Docker containers) - Navigate to the Upload tab in the frontend
- Enter file paths relative to
/app/data/(e.g.,data/document.pdf) - Optionally add metadata in JSON format (e.g.,
{"category": "research"}) - Click "Upload Documents" to start processing
- Monitor workflow status in Temporal UI at http://localhost:8080
Supported Formats: PDF, images (JPEG, PNG), audio files (with STT service)
The system supports advanced natural language queries with:
- Named Entity Recognition (NER): Detects countries, cities, dates, organizations
- Geographical Translation: Translates place names to multiple languages
- Date Parsing: Understands relative dates ("last month", "yesterday")
- Filter Generation: Converts entities to Meilisearch filters
Examples:
"documents from France last month"
"reports about Paris from 2024"
"contracts with USA companies"
- Keyword Search: Traditional full-text search
- Semantic Search: AI-powered vector similarity search
- Hybrid Search: Combines keyword + semantic (adjustable ratio)
# Health check
curl http://localhost:5050/multimodal-rag/health
# Natural language search
curl -X POST "http://localhost:5050/multimodal-rag/search" \
-H "Content-Type: application/json" \
-d '{
"q": "solar panels from France",
"search_type": "hybrid",
"federated": true,
"semanticRatio": 0.8
}'
# Upload documents
curl -X POST "http://localhost:5050/multimodal-rag/documents/upload" \
-H "Content-Type: application/json" \
-d '{
"file_paths": ["data/document.pdf"],
"metadata": {"category": "research"},
"batch_size": 100
}'
# Check workflow status
curl http://localhost:5050/multimodal-rag/jobs/{workflow_id}/statusmultimodal-rag/
βββ backend/
β βββ multimodal_rag_api/ # Main API package (Docker)
β β βββ src/multimodal_rag_api/
β β β βββ api/ # REST endpoints & services
β β β β βββ controller/ # API routes
β β β β βββ services/ # Business logic
β β β β βββ chunking.py
β β β β βββ document_processor.py
β β β β βββ embedding/ # Embedding services
β β β β βββ meilisearch/ # Search integration
β β β β βββ nl_search/ # NL query processing
β β β β β βββ pipeline/ # NER, translation, filters
β β β β βββ stt/ # STT client
β β β βββ temporal_worker/ # Workflow definitions
β β β βββ meilisearch_utils/ # Search utilities
β β β βββ models/ # Data models & settings
β β βββ pyproject.toml # Package dependencies
β β βββ docker-compose.yml # Full stack orchestration
β β βββ Dockerfile
β β βββ .env.example
β β
β βββ gpu_services/ # GPU services (Host)
β β βββ stt_service/ # Speech-to-Text
β β β βββ src/stt_service/
β β β β βββ api.py
β β β β βββ backend/ # Whisper backends
β β β β βββ main.py
β β β βββ pyproject.toml
β β βββ start-services.sh # Service manager
β β βββ logs/
β β
β βββ .env.gpu-services.example # GPU config reference
β
βββ frontend/ # React application
β βββ src/
β β βββ components/ # UI components
β β βββ services/ # API client
β β βββ App.jsx
β βββ package.json
β βββ .env.example
β
βββ data/ # Document storage (mounted)
βββ config/ # Configuration files
βββ Makefile # Development commands
βββ README.md
# Display all commands
make help
# Docker Services
make dev # Start Docker services only
make dev-full # Start GPU + Docker services
# GPU Services
make gpu-install # Install GPU dependencies (first time)
make gpu-start # Start GPU services
make gpu-stop # Stop GPU services
make gpu-status # Check GPU services status
make gpu-logs # View GPU services logs (real-time)
# Frontend
make front-dev # Start development server (port 5173)
make front # Build & preview production (port 4173)- Port conflicts: Ensure ports 5050, 5173, 7700, 8080, 4000, 8001 are available
- Docker issues: Run
docker-compose down -vfor full reset - GPU services: Check logs with
make gpu-logs - Ollama connection: Ensure Ollama is running on host (
ollama serve) - File permissions: Verify
./data/directory is accessible
# Backend
curl http://localhost:5050/multimodal-rag/health
# Meilisearch
curl http://localhost:7700/health
# LiteLLM
curl http://localhost:4000/health
# STT Service (if running)
curl http://localhost:8001/health# All Docker services
docker-compose -f backend/multimodal_rag_api/docker-compose.yml logs -f
# Specific service
docker-compose -f backend/multimodal_rag_api/docker-compose.yml logs -f multimodal-rag-api
# GPU services
make gpu-logs
# Temporal workflows
# Visit http://localhost:8080# Stop all services
docker-compose -f backend/multimodal_rag_api/docker-compose.yml down
make gpu-stop
# Full reset (removes volumes)
docker-compose -f backend/multimodal_rag_api/docker-compose.yml down -v
# Remove Meilisearch data
rm -rf backend/multimodal_rag_api/.meili_data- Start services:
make dev-full - Start frontend:
make front-dev - Upload a test document via UI
- Monitor workflow in Temporal UI (http://localhost:8080)
- Perform natural language search
- Verify results with proper filtering and highlighting
Interactive API documentation at http://localhost:5050/docs
- Temporal UI: http://localhost:8080 (workflow execution, history, debugging)
- Meilisearch UI: http://localhost:24900 (indexes, document counts, search testing)
- Jaeger UI: http://localhost:16686 (distributed tracing, performance analysis)
- Langfuse (optional): https://cloud.langfuse.com (LLM observability)
# GPU services on host, main API in Docker
STT_SERVICE_URL=http://host.docker.internal:8001
NL_SEARCH_OLLAMA_BASE_URL=http://host.docker.internal:11434/v1# GPU services on separate instance
STT_SERVICE_URL=https://gpu-server.example.com:8001
NL_SEARCH_PROVIDER=openai
OPENAI_API_KEY=sk-...- Deploy main API, Meilisearch, Temporal as pods
- Run GPU services on GPU-enabled nodes
- Use service discovery for inter-service communication
- Replace
masterKeywith secure Meilisearch API key - Configure production database for Temporal (not SQLite)
- Set up proper secrets management (not
.envfiles) - Enable HTTPS with reverse proxy (nginx, Caddy)
- Configure resource limits in
docker-compose.yml - Set up monitoring and alerting
- Configure backup strategy for Meilisearch data
- Use production-grade LLM API keys with rate limits
# Build and preview
make front
# Build only
cd frontend && npm run build
# Serve with nginx, Caddy, or any static file server- Meilisearch: https://www.meilisearch.com/docs/
- Temporal: https://docs.temporal.io/
- LiteLLM: https://docs.litellm.ai/
- FastAPI: https://fastapi.tiangolo.com/
- React: https://react.dev/
- Ollama: https://ollama.com/
Understanding the "why" behind design choices helps make better decisions when extending the system:
Decision: Separate indexes per language (documents-en, documents-es, etc.)
Why:
- Better tokenization (language-specific word splitting, stemming)
- Language-specific stopwords ("the", "and" in English vs "le", "et" in French)
- Easier to tune relevance per language
- Cleaner federated search results (merge via
guiddeduplication)
Alternative considered: Single multilingual index with language field
- β Worse search quality (generic tokenization)
- β Can't optimize per language
Decision: STT service runs directly on host machine, not in Docker container
Why:
- Direct GPU access (MPS on Mac, CUDA on NVIDIA)
- Avoid Docker GPU passthrough complexity
- Better performance (no virtualization overhead)
- Simpler debugging (native Python environment)
Trade-off: Less portable, requires host setup
- β Worth it for 2-3x performance gain
Decision: Use Temporal workflows instead of simple async tasks
Why:
- Reliability: Automatic retries with exponential backoff
- Visibility: Track execution history, debug failures
- Scalability: Process thousands of documents with continue-as-new
- State persistence: Workflows survive crashes and restarts
- Batching: Control concurrency (
max_in_flight_documents)
Alternative considered: Celery, plain async tasks
- β Manual retry logic, no execution history, harder to debug
Decision: Route all LLM/embedding calls through LiteLLM proxy
Why:
- Provider switching: Change from Ollama β OpenAI without code changes
- Caching: Redis cache saves API costs (30-70% hit rate)
- Unified API: One interface for OpenAI, Cohere, Anthropic, Ollama, etc.
- Observability: Built-in tracing and cost tracking
Alternative considered: Direct API calls per provider
- β Provider lock-in, no caching, inconsistent APIs
Decision: Split large documents into smaller chunks (1000-2000 characters)
Why:
- Token limits: Embedding models have max input size (512-8192 tokens)
- Search precision: Match specific passages, not entire documents
- Better highlighting: Show relevant excerpts to users
- Reduced noise: Avoid diluting relevance with irrelevant content
How it works: Hierarchical splitting (paragraphs β sentences β characters) with overlap
Decision: Use dictionary matching before ML models for entity extraction
Why:
- Speed: 10-100x faster than ML models
- Accuracy: 99%+ for known entities (countries, cities)
- No hallucination: Exact matches only
- Offline: No API calls needed
- ML fallback: Available for ambiguous cases
Strategy: Gazetteer β Country codes β Dates β ML model (only for uncovered text)
Decision: Support truncating embeddings to smaller dimensions
Why:
- Storage: 512D uses 4x less space than 2048D
- Speed: Faster similarity search (fewer dimensions)
- Flexibility: Tune storage/accuracy trade-off per use case
- No retraining: Models like nomic-embed support this natively
Example: 2048D β 512D with minimal quality loss (~2-3% accuracy drop)
Before contributing, please:
- Read the Developer Onboarding section
- Understand the Key Architectural Decisions
- Check code documentation (all critical files have comprehensive docstrings)
Contribution workflow:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes with clear commit messages
- Add docstrings to new functions (follow existing patterns)
- Test with
make devandmake front-dev - Verify workflows in Temporal UI (no failed executions)
- Check Jaeger traces for performance issues
- Push to your branch and open a Pull Request
Code style:
- Follow existing patterns (factory functions, dependency injection, async/await)
- Add type hints (Pydantic models preferred)
- Write docstrings for public functions (see
routes.pyfor examples) - Use descriptive variable names
- Keep functions focused (single responsibility)
MIT