Head-to-head benchmark suite comparing web crawlers on speed, extraction quality, retrieval quality, LLM answer quality, and cost at scale. Every benchmark is reproducible from a single command.
| Dimension | Winner | Key metric | Runner-up |
|---|---|---|---|
| Speed | scrapy+md | 9.1 pages/sec | colly+md (5.8 p/s) |
| Extraction quality | markcrawl | 100% content signal, 4 words preamble | scrapy+md (97%, 23 words) |
| Retrieval quality | crawlee | 85% Hit@5, 0.787 MRR | playwright (85%, 0.787) |
| LLM answer quality | markcrawl | 3.91/5 overall score | scrapy+md (3.86/5) |
| Cost at scale | markcrawl | $4,505/yr (100K pages, 1K q/day) | scrapy+md ($5,464/yr) |
| Pipeline timing | colly+md | 259.6s end-to-end | scrapy+md (268.0s) |
Bottom line: No single tool wins everything. scrapy+md and colly+md are fastest, but markcrawl produces the cleanest output, the best LLM answers, and the lowest cost at scale (21-66% savings on RAG infrastructure). Retrieval quality barely differs between tools — switching retrieval mode (e.g., to reranked) gains 15-20 points, while switching crawlers gains ~5.
| Tool | Type | JS rendering | Notes |
|---|---|---|---|
| markcrawl | Python | Optional | Markdown-first, lowest preamble |
| scrapy+md | Python | No | Fastest raw HTTP crawler |
| crawl4ai | Python | Built-in | AI-native, browser-based |
| crawl4ai-raw | Python | Built-in | crawl4ai with raw HTML output |
| colly+md | Go | No | Fast compiled crawler |
| crawlee | Python | Built-in | Apify's browser crawler |
| playwright | Python | Built-in | Microsoft's browser automation |
All tools output Markdown via the same html-to-markdown pipeline (except crawl4ai-raw). See METHODOLOGY.md for tool configurations and fairness decisions.
| Site | Pages | Type |
|---|---|---|
| quotes.toscrape.com | 15 | Simple paginated HTML |
| books.toscrape.com | 60 | E-commerce catalog |
| fastapi.tiangolo.com | 153 | API docs (code blocks, tutorials) |
| docs.python.org | 500 | Standard library reference |
| react.dev | 500 | SPA, JS-rendered |
| en.wikipedia.org | 50 | Tables, infoboxes, citations |
| docs.stripe.com | 500 | Tabbed content, code samples |
| github.blog | 200 | Blog articles, images |
| Report | Question it answers |
|---|---|
| Speed Comparison | Which crawler is fastest? |
| Quality Comparison | Which produces the cleanest Markdown? |
| Retrieval Comparison | Does cleaner Markdown improve retrieval? |
| Answer Quality | Does better retrieval improve LLM answers? |
| Cost at Scale | What does each crawler cost at 100K+ pages? |
| Pipeline Timing | How long does the full RAG pipeline take? |
| MarkCrawl Self-Benchmark | MarkCrawl standalone performance |
| Methodology | How were these benchmarks run? |
# Install dependencies
pip install -e ".[dev]"
# Preflight check (verifies all tools are installed)
python preflight.py
# Run all benchmarks (~3-5 hours)
python benchmark_all_tools.py
# Run individual benchmarks
python benchmark_quality.py
python benchmark_retrieval.py
python benchmark_answer_quality.py
python benchmark_pipeline.py
python benchmark_markcrawl.pydocker build -t llm-crawler-benchmarks .
docker run --rm \
-e OPENAI_API_KEY \
-v $(pwd)/reports:/app/reports \
-v $(pwd)/runs:/app/runs \
llm-crawler-benchmarksOther projects benchmark parts of the web scraping pipeline:
- Firecrawl scrape-evals — 1,000-URL extraction quality benchmark (precision/recall). Single-page quality only; no speed, retrieval, or LLM answer evaluation.
- WCXB — 2,008-page content extraction leaderboard with word-level F1. Covers traditional tools (trafilatura, readability) but not LLM-era crawlers.
- Spider.cloud benchmark — 3-tool comparison (Firecrawl, Crawl4AI, Spider) on throughput, cost, and RAG retrieval accuracy.
This project differs by evaluating the full RAG pipeline — from crawl through chunk, embed, retrieve, and LLM answer — across 7 tools, 8 sites, and 5 dimensions including downstream answer quality and cost at scale.
The self_improvement/ directory contains a 9-spec review framework for
auditing benchmark quality. See self_improvement/MASTER.md.
MIT