`markdown
A simple Retrieval-Augmented Generation (RAG) service that crawls a website, indexes its content, and answers questions only from what it has collected — always with source citations.
The project shows a minimal end-to-end RAG workflow:
- Crawl a website within its domain.
- Extract and clean text from HTML pages.
- Chunk and embed the text.
- Store embeddings in a FAISS index.
- Retrieve and answer questions using
Flan-T5, citing sources.
bash pip install fastapi uvicorn beautifulsoup4 tldextract sentence-transformers transformers faiss-cpu requests numpy `
bash uvicorn rag_service:app --reload --port 8000
Crawls a given website.
Example request
json {"start_url": "https://example.com", "max_pages": 30, "crawl_delay_ms": 500}
Response
json {"page_count": 27, "urls": ["https://example.com", "https://example.com/about"]}
Embeds and stores crawled text.
Request
json {"chunk_size": 800, "chunk_overlap": 100}
Response
json {"vector_count": 142, "errors": []}
Answers questions based only on crawled pages.
Request
json {"question": "Who founded the company?", "top_k": 3}
Response
json { "answer": "The company was founded by Jane Doe in 2012.", "sources": [{"url": "https://example.com/about", "snippet": "Founded by Jane Doe..."}] }
If the answer isn’t found:
json {"answer": "Not enough information found in crawled content."}
| Step | Component | Purpose |
|---|---|---|
| Crawl | requests, BeautifulSoup |
Collect HTML pages |
| Clean | Regex | Remove scripts, tags |
| Embed | all-MiniLM-L6-v2 |
Create text embeddings |
| Index | FAISS | Similarity search |
| Generate | flan-t5-base |
Produce grounded answers |
- Crawls only within the same domain.
- Simple HTML-only crawler (no JavaScript rendering).
- Refuses to answer when context is weak.
- Keeps crawl polite with a delay.
- Prioritizes correctness and traceability over speed.
Each query logs timings and retrieval stats to store/metrics.log.
Example metrics: retrieval time, generation time, total latency.
Answerable:
POST /ask { "question": "Who founded ExampleCorp?" } → "Jane Doe in 2012"
Unanswerable:
POST /ask { "question": "Where is Google located?" } → "Not enough information found in crawled content."
- Add quantitative retrieval evaluation.
- Store metadata in SQLite.
- Improve text chunking based on HTML structure.
- Optional command-line mode.
Author: Aditya Ayushman License: MIT Version: 1.0