Skip to content

ayushmXn/RAG_Service

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

`markdown

Mini RAG Web QA Service

A simple Retrieval-Augmented Generation (RAG) service that crawls a website, indexes its content, and answers questions only from what it has collected — always with source citations.


Overview

The project shows a minimal end-to-end RAG workflow:

  1. Crawl a website within its domain.
  2. Extract and clean text from HTML pages.
  3. Chunk and embed the text.
  4. Store embeddings in a FAISS index.
  5. Retrieve and answer questions using Flan-T5, citing sources.

Setup

1. Install dependencies

bash pip install fastapi uvicorn beautifulsoup4 tldextract sentence-transformers transformers faiss-cpu requests numpy `

2. Run the API

bash uvicorn rag_service:app --reload --port 8000


API Endpoints

/crawl

Crawls a given website.

Example request

json {"start_url": "https://example.com", "max_pages": 30, "crawl_delay_ms": 500}

Response

json {"page_count": 27, "urls": ["https://example.com", "https://example.com/about"]}


/index

Embeds and stores crawled text.

Request

json {"chunk_size": 800, "chunk_overlap": 100}

Response

json {"vector_count": 142, "errors": []}


/ask

Answers questions based only on crawled pages.

Request

json {"question": "Who founded the company?", "top_k": 3}

Response

json { "answer": "The company was founded by Jane Doe in 2012.", "sources": [{"url": "https://example.com/about", "snippet": "Founded by Jane Doe..."}] }

If the answer isn’t found:

json {"answer": "Not enough information found in crawled content."}


Architecture

Step Component Purpose
Crawl requests, BeautifulSoup Collect HTML pages
Clean Regex Remove scripts, tags
Embed all-MiniLM-L6-v2 Create text embeddings
Index FAISS Similarity search
Generate flan-t5-base Produce grounded answers

Notes and Design Choices

  • Crawls only within the same domain.
  • Simple HTML-only crawler (no JavaScript rendering).
  • Refuses to answer when context is weak.
  • Keeps crawl polite with a delay.
  • Prioritizes correctness and traceability over speed.

Logging

Each query logs timings and retrieval stats to store/metrics.log. Example metrics: retrieval time, generation time, total latency.


Example Behavior

Answerable:

POST /ask { "question": "Who founded ExampleCorp?" } → "Jane Doe in 2012"

Unanswerable:

POST /ask { "question": "Where is Google located?" } → "Not enough information found in crawled content."


Future Improvements

  • Add quantitative retrieval evaluation.
  • Store metadata in SQLite.
  • Improve text chunking based on HTML structure.
  • Optional command-line mode.

Author: Aditya Ayushman License: MIT Version: 1.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages