A Retrieval-Augmented Generation (RAG) system for question answering and document search over Suprema Biostar2 documentation and related biometric security resources.
- Document Parsing & Cleaning:
- Automated PDF parsing and cleaning using LlamaParse.
- Cleaned data stored in
cleaned_data/for efficient downstream processing.
- Vector Database:
- Embedding generation with HuggingFace models (configurable).
- Vector storage and retrieval using ChromaDB.
- RAG Pipeline:
- Modular pipeline for document retrieval, question answering, and answer grading.
- Supports both local and cloud LLMs (e.g., Gemini, Llama, OpenAI, HuggingFace).
- API & Frontend:
- FastAPI backend with
/queryendpoint for chat and search. - Streamlit-based frontend for interactive chatbot experience.
- FastAPI backend with
- Evaluation:
- Integrated evaluation scripts and metrics for LLM output quality.
- Logging & Observability:
- Langfuse integration for tracing and monitoring.
biometric-rag-agent/
├── cleaned_data/ # Cleaned text files from documentation
├── chroma_vector_db/ # Chroma vector database files
├── data/ # Raw data (PDFs, etc.)
├── diagrams/ # System diagrams and images
├── evaluation/ # Evaluation scripts and metrics
├── frontend/ # Streamlit app and UI components
├── notebooks/ # Jupyter notebooks for prototyping
├── src/ # Main source code (API, agent, data, vector_db, utils)
├── requirements.txt # Python dependencies
├── Makefile # Common commands
├── README.md # Project documentation
└── ...
- Install dependencies:
uv pip install -r requirements.txt # or, for full project management: uv pip install -r pyproject.toml - Set environment variables:
- Copy
.env.exampleto.envand fill in required API keys (Google, Langfuse, etc).
- Copy
- Prepare data:
- Place raw PDFs in the
data/directory. - Run the data cleaner:
python -m src.data_cleaner
- Place raw PDFs in the
- Build vector indexes:
python -m src.index_builder
- Start the API server:
python -m src.main
- Run the frontend:
streamlit run frontend/app.py
You can use the provided Makefile to run the full application pipeline. The recommended steps are:
- Clean the data:
make clean-data
- Build vector indexes:
make index
- Run the backend API server:
make backend
- Run the frontend:
make frontend
You can also chain these commands as needed for your workflow.
- Access the chatbot UI via the Streamlit app.
- Use the
/queryendpoint for programmatic access. - Evaluate model performance using scripts in the
evaluation/directory.
GOOGLE_API_KEY- Google Gemini API keyDB_URL- Database connection stringLANGFUSE_PUBLIC_KEY,LANGFUSE_SECRET_KEY,LANGFUSE_HOST_URL- Langfuse observability- See
.env.examplefor all options
Pull requests and issues are welcome! Please ensure code is well-documented and tested.
This project includes code and documentation under various open-source licenses. See the cleaned_data/ directory for license details from upstream documentation sources.
For more information, see the system diagrams in diagrams/ and notebooks in notebooks/.