A web scraping pipeline that extracts content from websites, processes it for LLM consumption, generates embeddings using intfloat/multilingual-e5-large-instruct, and stores vectors in pgvector.
- Configurable web scraping with depth and page limits
- Sitemap.xml support for efficient URL discovery
- HTML to Markdown transformation
- Multiple chunking strategies:
- Paragraph-aware chunking (default): Preserves paragraph boundaries
- First section chunking: Creates chunks from just the first section of content
- Hierarchical chunking: Merges H1/H2/H3 sections for larger, more meaningful chunks
- Sentence-aware chunking: Splits content at sentence boundaries
- Token-aware chunking with intelligent overlap handling (250 tokens with 10 token overlap by default)
- Vector embeddings generation using intfloat/multilingual-e5-large-instruct
- PostgreSQL/pgvector storage with langchain_postgres and psycopg3
- Batch processing of embeddings for memory efficiency with large datasets
- Automatic database connection handling with reconnection for long-running jobs
- Python 3.8+
- PostgreSQL with pgvector extension
- intfloat/multilingual-e5-large-instruct model
-
Install PostgreSQL and enable pgvector extension:
# On Ubuntu/Debian sudo apt install postgresql postgresql-contrib git clone https://github.com/pgvector/pgvector.git cd pgvector make sudo make install # On macOS with Homebrew brew install postgresql brew install pgvector # Don't forget to enable the extension in your database: # CREATE EXTENSION vector;
-
Create a virtual environment:
python -m venv venv
-
Activate the virtual environment:
# On Linux/Mac source venv/bin/activate # On Windows venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
This will install all required packages including
markdownifyfor HTML to Markdown conversion andtransformersfor token-aware chunking.
Set up your database connection and other settings using environment variables:
export DATABASE_URL="postgresql+psycopg://user:password@localhost:5432/rag_db"
export MAX_DEPTH=2
export PAGE_LIMIT=10
export CHUNK_SIZE=500
export CHUNK_OVERLAP=50
export BATCH_SIZE=100Or create a .env file with these variables.
# Default paragraph-aware chunking
python scraper.py --url "https://example.com" --depth 2 --page-limit 10
# First section chunking (reduced number of chunks)
python scraper.py --url "https://example.com" --depth 2 --page-limit 10 --chunking-strategy first_section
# Hierarchical chunking (merged sections)
python scraper.py --url "https://example.com" --depth 2 --page-limit 10 --chunking-strategy hierarchical
# Sentence-aware chunking
python scraper.py --url "https://example.com" --depth 2 --page-limit 10 --chunking-strategy sentence
# Using sitemap.xml for URL discovery (automatically tries common sitemap locations)
python scraper.py --url "https://example.com" --depth 2 --page-limit 10 --use-sitemap
# Custom collection name in vector database
python scraper.py --url "https://example.com" --depth 2 --page-limit 10 --collection-name my_custom_collection
# Custom batch size for processing embeddings (default: 100)
python scraper.py --url "https://example.com" --depth 2 --page-limit 10 --batch-size 50from src.pipeline import RAGPipeline
# Default paragraph-aware chunking
pipeline = RAGPipeline()
pipeline.run("https://example.com", max_depth=2, page_limit=10)
# First section chunking (reduced number of chunks)
pipeline_first = RAGPipeline(chunking_strategy="first_section")
pipeline_first.run("https://example.com", max_depth=2, page_limit=10)
# Hierarchical chunking (merged sections)
pipeline_hierarchical = RAGPipeline(chunking_strategy="hierarchical")
pipeline_hierarchical.run("https://example.com", max_depth=2, page_limit=10)
# Using sitemap.xml for URL discovery
pipeline_sitemap = RAGPipeline()
pipeline_sitemap.run("https://example.com", max_depth=2, page_limit=10, use_sitemap=True)
# Custom collection name
pipeline_custom = RAGPipeline(collection_name="my_custom_collection")
pipeline_custom.run("https://example.com", max_depth=2, page_limit=10)
# Custom batch size for processing embeddings
pipeline_batched = RAGPipeline()
pipeline_batched.run("https://example.com", max_depth=2, page_limit=10, batch_size=50)The scraper can optionally use sitemap.xml files for more efficient and comprehensive URL discovery. When the --use-sitemap flag is provided:
-
The scraper automatically tries common sitemap locations:
/sitemap.xml/sitemap_index.xml/sitemap.txt
-
If a sitemap is found, it parses all URLs and scrapes them directly without crawling
-
If no sitemap is found, it falls back to traditional crawling
This approach is beneficial because:
- Sitemaps often contain a complete list of pages the website owner wants indexed
- It's more efficient than crawling, especially for large sites
- It ensures you don't miss pages that aren't linked from other pages
- It respects the website owner's intended site structure
tests/test_pipeline.py: Simple test with a sample websitetests/test_gnucoop.py: Test with GNUcoop website that was returning 403tests/test_chunking.py: Test for the chunking functionalitytests/test_paragraph_chunking.py: Test for paragraph-aware chunkingtests/test_comprehensive_chunking.py: Comprehensive test for various chunking scenariostests/test_token_chunking.py: Test demonstrating token-based vs character-based chunkingtests/test_new_settings.py: Test demonstrating the new default settings (250 tokens, 10 overlap) with all chunking strategiestests/test_batch_processing.py: Test for batch processing functionality
To run the tests:
# Run a specific test
python -m tests.test_pipeline
# Run all tests
python -m unittest discover -s tests
# Run the new settings test
python -m tests.test_new_settings
# Run the batch processing test
python -m tests.test_batch_processing├── README.md
├── requirements.txt
├── scraper.py # Main entry point
├── example_usage.py # Example usage
├── .env.example # Example environment variables
└── src/
├── config/
│ └── settings.py # Configuration management
├── pipeline.py # Main pipeline orchestrator
└── utils/
├── scraper.py # Web scraping functionality
├── chunker.py # Text chunking
├── embeddings.py # Embedding generation
└── vector_store.py # Vector storage
└── tests/ # Test scripts
├── test_pipeline.py
├── test_gnucoop.py
├── test_chunking.py
├── test_paragraph_chunking.py
├── test_comprehensive_chunking.py
├── test_token_chunking.py
└── test_new_settings.py
| Environment Variable | Default Value | Description |
|---|---|---|
| DATABASE_URL | postgresql://user:password@localhost:5432/rag_db | PostgreSQL connection string |
| MAX_DEPTH | 2 | Maximum crawling depth |
| PAGE_LIMIT | 10 | Maximum number of pages to scrape |
| CHUNK_SIZE | 250 | Size of text chunks in tokens |
| CHUNK_OVERLAP | 10 | Overlap between chunks in tokens |
| EMBEDDING_MODEL | intfloat/multilingual-e5-large-instruct | Sentence transformer model for embeddings |
| BATCH_SIZE | 100 | Number of chunks to process in each batch |
Note: The collection name in the vector database defaults to "web_scraping_collection" but can be customized using the --collection-name command-line option or the collection_name parameter when initializing RAGPipeline programmatically.
-
Paragraph-aware chunking (default): Preserves natural paragraph boundaries when possible, only splitting paragraphs that exceed the chunk size limit.
-
First section chunking: Creates chunks using only the first section of each document (content before the first major heading). This significantly reduces the number of chunks, which can be useful when you have many documents and want to reduce processing time.
-
Hierarchical chunking: Merges H1, H2, and H3 sections to create larger, more meaningful chunks. This strategy is useful when you want to preserve the document structure while reducing the number of chunks.
-
Sentence-aware chunking: Splits content at sentence boundaries when paragraph boundaries are not suitable.
- The first time you run the pipeline, it will download the embedding model which may take some time.
- Make sure your PostgreSQL database has the pgvector extension enabled.
- The pipeline will create a collection in your database. By default, it uses "web_scraping_collection" but this can be customized with the
--collection-nameoption. - Some models have token limits (e.g., 512 tokens). If your chunks exceed this limit, you may see warnings. Consider adjusting the CHUNK_SIZE setting to stay within your model's limits.
- For large datasets, the pipeline processes embeddings in batches (default size: 100) to manage memory usage efficiently.
- The pipeline includes automatic database connection handling with reconnection for long-running jobs that may exceed database connection timeouts.