Mini RAG Web QA Service

`markdown

Mini RAG Web QA Service

A simple Retrieval-Augmented Generation (RAG) service that crawls a website, indexes its content, and answers questions only from what it has collected — always with source citations.

Overview

The project shows a minimal end-to-end RAG workflow:

Crawl a website within its domain.
Extract and clean text from HTML pages.
Chunk and embed the text.
Store embeddings in a FAISS index.
Retrieve and answer questions using Flan-T5, citing sources.

Setup

1. Install dependencies

bash pip install fastapi uvicorn beautifulsoup4 tldextract sentence-transformers transformers faiss-cpu requests numpy `

2. Run the API

bash uvicorn rag_service:app --reload --port 8000

API Endpoints

`/crawl`

Crawls a given website.

Example request

json {"start_url": "https://example.com", "max_pages": 30, "crawl_delay_ms": 500}

Response

json {"page_count": 27, "urls": ["https://example.com", "https://example.com/about"]}

`/index`

Embeds and stores crawled text.

Request

json {"chunk_size": 800, "chunk_overlap": 100}

Response

json {"vector_count": 142, "errors": []}

`/ask`

Answers questions based only on crawled pages.

Request

json {"question": "Who founded the company?", "top_k": 3}

Response

json { "answer": "The company was founded by Jane Doe in 2012.", "sources": [{"url": "https://example.com/about", "snippet": "Founded by Jane Doe..."}] }

If the answer isn’t found:

json {"answer": "Not enough information found in crawled content."}

Architecture

Step	Component	Purpose
Crawl	`requests`, `BeautifulSoup`	Collect HTML pages
Clean	Regex	Remove scripts, tags
Embed	`all-MiniLM-L6-v2`	Create text embeddings
Index	FAISS	Similarity search
Generate	`flan-t5-base`	Produce grounded answers

Notes and Design Choices

Crawls only within the same domain.
Simple HTML-only crawler (no JavaScript rendering).
Refuses to answer when context is weak.
Keeps crawl polite with a delay.
Prioritizes correctness and traceability over speed.

Logging

Each query logs timings and retrieval stats to store/metrics.log. Example metrics: retrieval time, generation time, total latency.

Example Behavior

Answerable:

POST /ask { "question": "Who founded ExampleCorp?" } → "Jane Doe in 2012"

Unanswerable:

POST /ask { "question": "Where is Google located?" } → "Not enough information found in crawled content."

Future Improvements

Add quantitative retrieval evaluation.
Store metadata in SQLite.
Improve text chunking based on HTML structure.
Optional command-line mode.

Author: Aditya Ayushman License: MIT Version: 1.0

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
app.py		app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mini RAG Web QA Service

Overview

Setup

1. Install dependencies

2. Run the API

API Endpoints

`/crawl`

`/index`

`/ask`

Architecture

Notes and Design Choices

Logging

Example Behavior

Future Improvements

About

Uh oh!

Releases

Packages

Languages

ayushmXn/RAG_Service

Folders and files

Latest commit

History

Repository files navigation

Mini RAG Web QA Service

Overview

Setup

1. Install dependencies

2. Run the API

API Endpoints

/crawl

/index

/ask

Architecture

Notes and Design Choices

Logging

Example Behavior

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`/crawl`

`/index`

`/ask`

Packages