A low-code, comprehensive framework for evaluating and comparing Retrieval-Augmented Generation (RAG) systems across different models, embedding techniques, and datasets. Configure your evaluation through YAML files without writing a single line of code.
- Low-Code Solution: Configure your entire RAG evaluation pipeline through simple YAML files
- Dual Deployment Options: Run evaluations on cloud-based models or locally with Ollama
- Comprehensive Metrics: Evaluate both retrieval quality and generation accuracy
- Seamless LangSmith Integration: Sync results directly with LangChain's LangSmith for advanced analytics
- Multilingual Support: Test RAG systems on English, Arabic, and other languages
- Visualized Reports: Interactive HTML dashboards for easy result interpretation
Developed by ScaleXI Innovation, specialists in Generative AI and Large Language Models solutions.
scalexi_rag_bench/
├── config/ # Configuration files for RAG evaluations
├── data/ # Test datasets for evaluation
│ ├── english/ # English language datasets
│ └── arabic/ # Arabic language datasets
├── examples/ # Example scripts for different evaluation scenarios
│ └── results/ # Results from example evaluations
├── rag_tools/ # Core evaluation scripts and pipelines
│ └── results/ # Results from RAG evaluations
├── results/ # Evaluation results by model and language
│ ├── langsmith_evaluation/ # Results from LangSmith evaluations
│ └── minimal_evaluation/ # Results from minimal evaluations
├── scalexi_rag_bench/ # Core framework implementation
│ ├── config/ # Configuration handling
│ ├── evaluators/ # Evaluation metrics and implementations
│ ├── models/ # Model adapters for LLMs and embeddings
│ │ ├── embedding_adapters/ # Embedding model implementations
│ │ └── llm_adapters/ # LLM model implementations
│ ├── retrievers/ # Retrieval mechanisms
│ └── utils/ # Utility functions
├── utils/ # General utility scripts
├── vectorstore_tools/ # Tools for creating and managing vector stores
└── vectorstores/ # Storage for generated vector databases
├── english_txt/ # Vector stores for English content
└── arabic_qa/ # Vector stores for Arabic content
The toolkit provides three main ways to run RAG evaluations:
For evaluating RAG systems using cloud-based models like OpenAI's GPT models:
./rag_tools/run_cloud_rag_evaluation.shFor evaluating RAG systems using local models via Ollama:
./rag_tools/run_local_rag_evaluation.shFor more control over the evaluation process:
python rag_tools/rag_evaluation_pipeline.py --config <config_file>The toolkit supports two primary modes of evaluation:
- Uses cloud-based LLMs (OpenAI, Anthropic, etc.)
- Higher accuracy but requires API keys and has associated costs
- Better suited for production-grade evaluations
- Provides advanced metrics including cost tracking
- Uses local models through Ollama
- Free to use and completely private
- Lower resource requirements but may have reduced performance
- Ideal for development and testing
Configuration files in YAML format control the entire evaluation pipeline:
dataset:
format: json # Dataset format
name: my_dataset # Dataset name
path: data/my_dataset.json # Path to dataset file
description: My RAG Evaluation # Description of the evaluation
evaluation_metrics:
generation_metrics: # Which generation metrics to use
- correctness
- relevance
- groundedness
retrieval_metrics: # Which retrieval metrics to use
- precision_at_k
- recall_at_k
system_metrics: # System performance metrics
- latency
llm:
model_name: gpt-4o-mini # LLM model to use
provider: openai # Model provider (openai, anthropic, ollama)
temperature: 0.0 # Temperature for generation
retrieval:
chunk_size: 1000 # Size of text chunks
chunk_overlap: 200 # Overlap between chunks
embedding_model: text-embedding-3-large # Embedding model
k: 4 # Number of documents to retrieve
vectorstore:
output_dir: ./vectorstores/my_dataset # Where to save the vector store
source_path: data/source.txt # Source documents to indexThe ScaleXI RAG Benchmark Framework is designed to provide a low-code approach to RAG evaluation with comprehensive insights:
-
Zero-Code Configuration
- Simply define your evaluation parameters in a YAML configuration file
- No Python coding required - the framework handles all the implementation details
- Modify parameters and re-run evaluations to quickly iterate and optimize
-
Preparing Your Knowledge Base
- Your source documents (specified in
source_pathin your config) are loaded - These can be text files, PDFs, or other supported formats in directories like
data/english/ordata/arabic/ - Documents are split into smaller chunks (controlled by
chunk_sizeandchunk_overlapin your config) - For example, with
chunk_size: 1000andchunk_overlap: 200, a 3000-word document becomes ~4 overlapping chunks
- Your source documents (specified in
-
Creating the Vector Database
- Each document chunk is converted into a numerical vector using the embedding model (like
text-embedding-3-large) - These vectors capture the semantic meaning of your text
- Vectors are stored in a searchable database (FAISS) at the location specified in
vectorstore.output_dir - You can see examples of these in the
vectorstores/directory
- Each document chunk is converted into a numerical vector using the embedding model (like
-
Retrieval Process
- When a query/question is processed, it's also converted to a vector using the same embedding model
- The system finds the most similar document chunks by comparing vector similarity
- The number of chunks retrieved is controlled by the
kparameter in your config (e.g.,k: 4retrieves the 4 most relevant chunks)
-
Generation with Context
- The retrieved document chunks are provided as context to the LLM specified in your config (e.g.,
gpt-4o-mini) - The LLM generates an answer based on this context and the question
- The model, temperature, and other generation parameters are controlled through your config file
- The retrieved document chunks are provided as context to the LLM specified in your config (e.g.,
-
Evaluation and Reporting
- The system compares generated answers against reference answers from your dataset
- Multiple metrics are calculated based on your configuration (correctness, relevance, etc.)
- Results are saved to the directory structure in
results/organized by model and language - Interactive HTML reports make it easy to analyze performance
- Optional seamless integration with LangSmith for deeper analytics
To adapt this to your own data:
- Place your source documents in a directory (similar to
data/english/ordata/arabic/) - Create a test dataset with questions and expected answers (see format examples in existing datasets)
- Update your config file to point to your new data sources and desired output locations
- Run the evaluation using the appropriate script (
run_cloud_rag_evaluation.shorrun_local_rag_evaluation.sh)
All file paths in your config are relative to the project root, making it easy to organize your custom evaluations.
Follow these steps to set up and run your own RAG evaluation with custom data:
-
Prepare Your Knowledge Base Documents
- Create a new folder for your documents in the
data/directory:mkdir -p data/your_dataset_name
- Add your source documents to this folder (supported formats: TXT, PDF, DOCX, CSV, JSON)
- Example structure:
data/ ├── english/ # Existing datasets ├── arabic/ # Existing datasets └── your_dataset_name/ # Your new dataset ├── document1.pdf ├── document2.txt └── ...
- Create a new folder for your documents in the
-
Create Your Test Dataset
- Create a JSON file with questions and expected answers in the same directory:
- Save it as
data/your_dataset_name/questions.json - Format:
[ { "question": "What is the main purpose of X?", "answer": "The main purpose of X is...", "relevant_docs": ["document1.pdf", "document2.txt"] # Optional }, { "question": "When was Y established?", "answer": "Y was established in...", "relevant_docs": ["document3.txt"] # Optional } ] - The
relevant_docsfield is optional but helps evaluate retrieval performance
-
Create Configuration File
- Copy an existing config file as a starting point:
cp config/llm_cloud_rag.yaml config/your_evaluation.yaml
- Update the following key paths:
dataset: format: json name: your_dataset_name path: data/your_dataset_name/questions.json description: Your Custom RAG Evaluation # ... other settings ... retrieval: chunk_size: 1000 # Adjust based on your document type chunk_overlap: 200 embedding_model: text-embedding-3-large k: 4 # Number of chunks to retrieve vectorstore: output_dir: ./vectorstores/your_dataset_name source_path: data/your_dataset_name # Points to your documents folder
- Copy an existing config file as a starting point:
-
Create Output Directories
- Ensure your results directory exists:
mkdir -p results/your_evaluation_name
- Ensure your results directory exists:
-
Run the Evaluation
- For cloud-based models (OpenAI, etc.):
./rag_tools/run_cloud_rag_evaluation.sh -f --config config/your_evaluation.yaml
- For local models using Ollama:
./rag_tools/run_local_rag_evaluation.sh -f --config config/your_evaluation.yaml
- The
-fflag forces rebuilding the vector store (needed first time or when documents change)
- For cloud-based models (OpenAI, etc.):
-
Access Your Results
- Results will be stored in:
results/your_evaluation_name/ ├── metrics.json # Overall metrics ├── results.json # Detailed results for each question ├── evaluation.html # Interactive HTML report └── raw/ # Raw outputs and intermediate data - Visualize results by opening
results/your_evaluation_name/evaluation.htmlin a browser
- Results will be stored in:
-
Iterative Improvement
- Analyze results to identify patterns of success or failure
- Adjust parameters in your config (chunk size, overlap, k, etc.)
- Rerun evaluation to compare performance
The framework features seamless integration with LangChain's LangSmith for detailed evaluation tracking and analytics:
-
Zero-Configuration Integration
- Once LangSmith credentials are set up, all evaluations are automatically tracked
- No additional coding or setup needed - the framework handles all integration points
- Simply run your evaluations as normal and results flow to LangSmith in real-time
-
Set Up LangSmith
- Sign up for LangSmith at smith.langchain.com
- Set environment variables in your
.envfile or shell:export LANGCHAIN_API_KEY=your_api_key export LANGCHAIN_TRACING_V2=true export LANGSMITH_PROJECT=rag-evaluation
-
Comprehensive Analytics
- Each evaluation run creates a new experiment in LangSmith
- Track detailed metrics on retrieval quality, generation accuracy, and cost
- Visualize performance patterns across different models and configurations
- Identify bottlenecks and optimization opportunities
-
Compare Evaluations
- Easily compare different RAG configurations side-by-side
- Analyze how changes to embeddings, chunk sizes, or retrieval settings impact performance
- Make data-driven decisions about your RAG system architecture
The integration provides enterprise-grade analytics capabilities without requiring any additional development work, allowing you to focus on optimizing your RAG systems rather than building evaluation infrastructure.
The toolkit supports a wide range of components that can be configured in your evaluation:
Configure these in the llm section of your config:
llm:
model_name: "gpt-4o-mini" # The model name
provider: "openai" # The provider name
temperature: 0.0 # Temperature for generationSupported providers:
-
OpenAI (
provider: "openai")- Models:
gpt-4o,gpt-4o-mini,gpt-4,gpt-3.5-turbo, etc. - Requires
OPENAI_API_KEYin environment or config
- Models:
-
Cohere (
provider: "cohere")- Models:
command,command-light, etc. - Requires
COHERE_API_KEYin environment or config
- Models:
-
Ollama (
provider: "ollama")- Models:
llama3,gemma,mistral, etc. - Default base URL:
http://localhost:11434 - Perfect for local evaluation without API costs
- Models:
-
HuggingFace (
provider: "huggingface")- Models: Specify any HuggingFace model compatible with AutoModelForCausalLM
- Example:
meta-llama/Llama-3-8b-chat-hf,google/gemma-7b, etc. - Requires appropriate compute resources
Configure these in the retrieval section of your config:
retrieval:
embedding_model: "text-embedding-3-large" # The embedding model to useSupported embedding models:
- OpenAI (specify
text-embedding-3-large,text-embedding-3-small, ortext-embedding-ada-002) - Cohere (specify any model with
coherein the name) - BGE Models (specify any model with
bgein the name) - HuggingFace (any other model name is treated as a HuggingFace model)
- Multilingual E5 (specify models with
multilingual-e5in the name for multilingual support)
Configure retriever type in the retrieval section:
retrieval:
retriever_type: "vector" # The type of retriever to use
search_type: "similarity" # Search type for vector store
k: 4 # Number of documents to retrieveSupported retriever types:
- Vector (
retriever_type: "vector"): Standard vector similarity search using FAISS - Chroma (
retriever_type: "chroma"): ChromaDB-based retriever for persistent vector storage - BM25 (
retriever_type: "bm25"): Keyword-based retrieval using BM25 algorithm - Hybrid (
retriever_type: "hybrid"): Combines vector and BM25 retrieval (0.5 weight each)
Vector stores are automatically created based on your configuration:
vectorstore:
output_dir: "./vectorstores/your_dataset" # Where to save the vector store
source_path: "data/your_dataset" # Documents to indexCurrent implementation supports:
- FAISS: Default for vector retrievers, efficient for similarity search
- Chroma: Used when
retriever_type: "chroma", supports persistence and metadata filtering
For vector retrievers, you can specify different search algorithms:
retrieval:
search_type: "similarity" # The search type to useSupported search types:
- Similarity (
search_type: "similarity"): Standard cosine similarity search - MMR (
search_type: "mmr"): Maximum Marginal Relevance for diversity in results
The toolkit includes several additional scripts:
vectorstore_tools/create_vectorstore.sh: Create vector stores separatelyupdate_repo.sh: Update the repository to the latest versionutils/: Various utilities for working with the framework
For advanced usage scenarios, refer to the example scripts in the examples/ directory:
- Simple RAG Evaluation: Basic evaluation workflow
- LangSmith Integration: Detailed evaluation with LangSmith
- Custom Evaluators: Creating your own evaluation metrics
The toolkit requires Python 3.8+ and the packages listed in requirements.txt:
- langchain
- langchain-openai
- langchain-community
- langchain-chroma
- langchain-core
- langgraph
- langsmith
- openai
- chromadb
- sentence-transformers
- and more
Install dependencies with:
pip install -r requirements.txtFor local evaluations, Ollama must be installed and running.
For local evaluations, the toolkit uses Ollama to run models on your machine. This section provides guidance on setting up Gemma 3 and other models locally.
- Download and install Ollama for your platform from ollama.ai
- Verify installation by running in your terminal:
ollama --version
Gemma 3 models are Google's latest open language models that offer excellent performance for local use:
-
Pull the Gemma 3 model you want to use:
# For the 8B model ollama pull gemma3:8b # For the instruction-tuned 8B model (recommended for RAG) ollama pull gemma3:8b-instruct # For the smaller 2B model ollama pull gemma3:2b # For the instruction-tuned 2B model ollama pull gemma3:2b-instruct
-
Verify the model is installed:
ollama list
-
Update your configuration file to use Gemma 3:
llm: model_name: "gemma3:8b-instruct" # Choose the appropriate model provider: "ollama" temperature: 0.1
Other high-performing models for local RAG evaluation:
-
Llama 3: Meta's latest open models
ollama pull llama3:8b ollama pull llama3:8b-instruct
-
Mistral: Excellent performance-to-size ratio
ollama pull mistral:7b ollama pull mistral:7b-instruct
-
Neural Chat: Optimized for conversational use
ollama pull neural-chat:7b
You can customize model parameters by specifying them in your config:
llm:
model_name: "gemma3:8b-instruct"
provider: "ollama"
temperature: 0.1
base_url: "http://localhost:11434" # Default Ollama API endpoint- Memory Issues: For large models like Gemma 3 8B, ensure your system has at least 16GB RAM
- Slow First Run: The first run will be slower as the model loads into memory
- CUDA Support: For GPU acceleration, ensure you have CUDA installed if using NVIDIA GPUs
- Server Connection: Verify Ollama is running with
ollama serveif you encounter connection issues
Create a .env file in the project root with your API keys:
OPENAI_API_KEY=your_openai_api_key
LANGCHAIN_API_KEY=your_langchain_api_key
LANGSMITH_API_KEY=your_langsmith_api_key
Contributions are welcome! Please feel free to submit a Pull Request.
This framework is developed and maintained by ScaleXI Innovation, specialists in Generative AI and Large Language Model solutions.
ScaleXI Innovation specializes in:
- Generative AI for Business Automation
- AI-Driven Digital Transformation
- Generative AI Consultation
- Enterprise-Grade LLM Solutions
For more information about our services, visit our website at https://scalexi.ai or contact us at [email protected].
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
You are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material
Under the following terms:
- Attribution — You must give appropriate credit to ScaleXI Innovation, provide a link to the license, and indicate if changes were made.
- NonCommercial — You may not use the material for commercial purposes without explicit permission from ScaleXI Innovation.
