Website Scraper for RAG

A web scraping pipeline that extracts content from websites, processes it for LLM consumption, generates embeddings using intfloat/multilingual-e5-large-instruct, and stores vectors in pgvector.

Features

Configurable web scraping with depth and page limits
Sitemap.xml support for efficient URL discovery
HTML to Markdown transformation
Multiple chunking strategies:
- Paragraph-aware chunking (default): Preserves paragraph boundaries
- First section chunking: Creates chunks from just the first section of content
- Hierarchical chunking: Merges H1/H2/H3 sections for larger, more meaningful chunks
- Sentence-aware chunking: Splits content at sentence boundaries
Token-aware chunking with intelligent overlap handling (250 tokens with 10 token overlap by default)
Vector embeddings generation using intfloat/multilingual-e5-large-instruct
PostgreSQL/pgvector storage with langchain_postgres and psycopg3
Batch processing of embeddings for memory efficiency with large datasets
Automatic database connection handling with reconnection for long-running jobs

Requirements

Python 3.8+
PostgreSQL with pgvector extension
intfloat/multilingual-e5-large-instruct model

Installation

Install PostgreSQL and enable pgvector extension:

# On Ubuntu/Debian
sudo apt install postgresql postgresql-contrib
git clone https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install

# On macOS with Homebrew
brew install postgresql
brew install pgvector

# Don't forget to enable the extension in your database:
# CREATE EXTENSION vector;

Create a virtual environment:
```
python -m venv venv
```

Activate the virtual environment:

# On Linux/Mac
source venv/bin/activate

# On Windows
venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
This will install all required packages including markdownify for HTML to Markdown conversion and transformers for token-aware chunking.

Configuration

Set up your database connection and other settings using environment variables:

export DATABASE_URL="postgresql+psycopg://user:password@localhost:5432/rag_db"
export MAX_DEPTH=2
export PAGE_LIMIT=10
export CHUNK_SIZE=500
export CHUNK_OVERLAP=50
export BATCH_SIZE=100

Or create a .env file with these variables.

Usage

Command Line Interface

# Default paragraph-aware chunking
python scraper.py --url "https://example.com" --depth 2 --page-limit 10

# First section chunking (reduced number of chunks)
python scraper.py --url "https://example.com" --depth 2 --page-limit 10 --chunking-strategy first_section

# Hierarchical chunking (merged sections)
python scraper.py --url "https://example.com" --depth 2 --page-limit 10 --chunking-strategy hierarchical

# Sentence-aware chunking
python scraper.py --url "https://example.com" --depth 2 --page-limit 10 --chunking-strategy sentence

# Using sitemap.xml for URL discovery (automatically tries common sitemap locations)
python scraper.py --url "https://example.com" --depth 2 --page-limit 10 --use-sitemap

# Custom collection name in vector database
python scraper.py --url "https://example.com" --depth 2 --page-limit 10 --collection-name my_custom_collection

# Custom batch size for processing embeddings (default: 100)
python scraper.py --url "https://example.com" --depth 2 --page-limit 10 --batch-size 50

Programmatic Usage

from src.pipeline import RAGPipeline

# Default paragraph-aware chunking
pipeline = RAGPipeline()
pipeline.run("https://example.com", max_depth=2, page_limit=10)

# First section chunking (reduced number of chunks)
pipeline_first = RAGPipeline(chunking_strategy="first_section")
pipeline_first.run("https://example.com", max_depth=2, page_limit=10)

# Hierarchical chunking (merged sections)
pipeline_hierarchical = RAGPipeline(chunking_strategy="hierarchical")
pipeline_hierarchical.run("https://example.com", max_depth=2, page_limit=10)

# Using sitemap.xml for URL discovery
pipeline_sitemap = RAGPipeline()
pipeline_sitemap.run("https://example.com", max_depth=2, page_limit=10, use_sitemap=True)

# Custom collection name
pipeline_custom = RAGPipeline(collection_name="my_custom_collection")
pipeline_custom.run("https://example.com", max_depth=2, page_limit=10)

# Custom batch size for processing embeddings
pipeline_batched = RAGPipeline()
pipeline_batched.run("https://example.com", max_depth=2, page_limit=10, batch_size=50)

Sitemap Support

The scraper can optionally use sitemap.xml files for more efficient and comprehensive URL discovery. When the --use-sitemap flag is provided:

The scraper automatically tries common sitemap locations:
- /sitemap.xml
- /sitemap_index.xml
- /sitemap.txt
If a sitemap is found, it parses all URLs and scrapes them directly without crawling
If no sitemap is found, it falls back to traditional crawling

This approach is beneficial because:

Sitemaps often contain a complete list of pages the website owner wants indexed
It's more efficient than crawling, especially for large sites
It ensures you don't miss pages that aren't linked from other pages
It respects the website owner's intended site structure

Example Scripts

tests/test_pipeline.py: Simple test with a sample website
tests/test_gnucoop.py: Test with GNUcoop website that was returning 403
tests/test_chunking.py: Test for the chunking functionality
tests/test_paragraph_chunking.py: Test for paragraph-aware chunking
tests/test_comprehensive_chunking.py: Comprehensive test for various chunking scenarios
tests/test_token_chunking.py: Test demonstrating token-based vs character-based chunking
tests/test_new_settings.py: Test demonstrating the new default settings (250 tokens, 10 overlap) with all chunking strategies
tests/test_batch_processing.py: Test for batch processing functionality

To run the tests:

# Run a specific test
python -m tests.test_pipeline

# Run all tests
python -m unittest discover -s tests

# Run the new settings test
python -m tests.test_new_settings

# Run the batch processing test
python -m tests.test_batch_processing

Project Structure

├── README.md
├── requirements.txt
├── scraper.py                # Main entry point
├── example_usage.py          # Example usage
├── .env.example              # Example environment variables
└── src/
    ├── config/
    │   └── settings.py       # Configuration management
    ├── pipeline.py           # Main pipeline orchestrator
    └── utils/
        ├── scraper.py        # Web scraping functionality
        ├── chunker.py        # Text chunking
        ├── embeddings.py     # Embedding generation
        └── vector_store.py   # Vector storage
└── tests/                    # Test scripts
    ├── test_pipeline.py
    ├── test_gnucoop.py
    ├── test_chunking.py
    ├── test_paragraph_chunking.py
    ├── test_comprehensive_chunking.py
    ├── test_token_chunking.py
    └── test_new_settings.py

Configuration Options

Environment Variable	Default Value	Description
DATABASE_URL	postgresql://user:password@localhost:5432/rag_db	PostgreSQL connection string
MAX_DEPTH	2	Maximum crawling depth
PAGE_LIMIT	10	Maximum number of pages to scrape
CHUNK_SIZE	250	Size of text chunks in tokens
CHUNK_OVERLAP	10	Overlap between chunks in tokens
EMBEDDING_MODEL	intfloat/multilingual-e5-large-instruct	Sentence transformer model for embeddings
BATCH_SIZE	100	Number of chunks to process in each batch

Note: The collection name in the vector database defaults to "web_scraping_collection" but can be customized using the --collection-name command-line option or the collection_name parameter when initializing RAGPipeline programmatically.

Chunking Strategies

Paragraph-aware chunking (default): Preserves natural paragraph boundaries when possible, only splitting paragraphs that exceed the chunk size limit.
First section chunking: Creates chunks using only the first section of each document (content before the first major heading). This significantly reduces the number of chunks, which can be useful when you have many documents and want to reduce processing time.
Hierarchical chunking: Merges H1, H2, and H3 sections to create larger, more meaningful chunks. This strategy is useful when you want to preserve the document structure while reducing the number of chunks.
Sentence-aware chunking: Splits content at sentence boundaries when paragraph boundaries are not suitable.

Notes

The first time you run the pipeline, it will download the embedding model which may take some time.
Make sure your PostgreSQL database has the pgvector extension enabled.
The pipeline will create a collection in your database. By default, it uses "web_scraping_collection" but this can be customized with the --collection-name option.
Some models have token limits (e.g., 512 tokens). If your chunks exceed this limit, you may see warnings. Consider adjusting the CHUNK_SIZE setting to stay within your model's limits.
For large datasets, the pipeline processes embeddings in batches (default size: 100) to manage memory usage efficiently.
The pipeline includes automatic database connection handling with reconnection for long-running jobs that may exceed database connection timeouts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Website Scraper for RAG

Features

Requirements

Installation

Configuration

Usage

Command Line Interface

Programmatic Usage

Sitemap Support

Example Scripts

Project Structure

Configuration Options

Chunking Strategies

Notes

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
ENHANCEMENTS.md		ENHANCEMENTS.md
README.md		README.md
docker-compose.yml		docker-compose.yml
example_usage.py		example_usage.py
requirements.txt		requirements.txt
scraper.py		scraper.py

tulas75/qwen-scrape

Folders and files

Latest commit

History

Repository files navigation

Website Scraper for RAG

Features

Requirements

Installation

Configuration

Usage

Command Line Interface

Programmatic Usage

Sitemap Support

Example Scripts

Project Structure

Configuration Options

Chunking Strategies

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages