llm-crawler-benchmarks

Which web crawler is best for LLM/RAG pipelines? We tested 7 tools across 8 sites to find out.

Head-to-head benchmark suite comparing web crawlers on speed, extraction quality, retrieval quality, LLM answer quality, and cost at scale. Every benchmark is reproducible from a single command.

Key Findings

Dimension	Winner	Key metric	Runner-up
Speed	scrapy+md	9.1 pages/sec	colly+md (5.8 p/s)
Extraction quality	markcrawl	100% content signal, 4 words preamble	scrapy+md (97%, 23 words)
Retrieval quality	crawlee	85% Hit@5, 0.787 MRR	playwright (85%, 0.787)
LLM answer quality	markcrawl	3.91/5 overall score	scrapy+md (3.86/5)
Cost at scale	markcrawl	$4,505/yr (100K pages, 1K q/day)	scrapy+md ($5,464/yr)
Pipeline timing	colly+md	259.6s end-to-end	scrapy+md (268.0s)

Bottom line: No single tool wins everything. scrapy+md and colly+md are fastest, but markcrawl produces the cleanest output, the best LLM answers, and the lowest cost at scale (21-66% savings on RAG infrastructure). Retrieval quality barely differs between tools — switching retrieval mode (e.g., to reranked) gains 15-20 points, while switching crawlers gains ~5.

Tools Compared

Tool	Type	JS rendering	Notes
markcrawl	Python	Optional	Markdown-first, lowest preamble
scrapy+md	Python	No	Fastest raw HTTP crawler
crawl4ai	Python	Built-in	AI-native, browser-based
crawl4ai-raw	Python	Built-in	crawl4ai with raw HTML output
colly+md	Go	No	Fast compiled crawler
crawlee	Python	Built-in	Apify's browser crawler
playwright	Python	Built-in	Microsoft's browser automation

All tools output Markdown via the same html-to-markdown pipeline (except crawl4ai-raw). See METHODOLOGY.md for tool configurations and fairness decisions.

Sites Tested

Site	Pages	Type
quotes.toscrape.com	15	Simple paginated HTML
books.toscrape.com	60	E-commerce catalog
fastapi.tiangolo.com	153	API docs (code blocks, tutorials)
docs.python.org	500	Standard library reference
react.dev	500	SPA, JS-rendered
en.wikipedia.org	50	Tables, infoboxes, citations
docs.stripe.com	500	Tabbed content, code samples
github.blog	200	Blog articles, images

Reports

Report	Question it answers
Speed Comparison	Which crawler is fastest?
Quality Comparison	Which produces the cleanest Markdown?
Retrieval Comparison	Does cleaner Markdown improve retrieval?
Answer Quality	Does better retrieval improve LLM answers?
Cost at Scale	What does each crawler cost at 100K+ pages?
Pipeline Timing	How long does the full RAG pipeline take?
MarkCrawl Self-Benchmark	MarkCrawl standalone performance
Methodology	How were these benchmarks run?

Quick Start

# Install dependencies
pip install -e ".[dev]"

# Preflight check (verifies all tools are installed)
python preflight.py

# Run all benchmarks (~3-5 hours)
python benchmark_all_tools.py

# Run individual benchmarks
python benchmark_quality.py
python benchmark_retrieval.py
python benchmark_answer_quality.py
python benchmark_pipeline.py
python benchmark_markcrawl.py

Docker

docker build -t llm-crawler-benchmarks .
docker run --rm \
  -e OPENAI_API_KEY \
  -v $(pwd)/reports:/app/reports \
  -v $(pwd)/runs:/app/runs \
  llm-crawler-benchmarks

Related Work

Other projects benchmark parts of the web scraping pipeline:

Firecrawl scrape-evals — 1,000-URL extraction quality benchmark (precision/recall). Single-page quality only; no speed, retrieval, or LLM answer evaluation.
WCXB — 2,008-page content extraction leaderboard with word-level F1. Covers traditional tools (trafilatura, readability) but not LLM-era crawlers.
Spider.cloud benchmark — 3-tool comparison (Firecrawl, Crawl4AI, Spider) on throughput, cost, and RAG retrieval accuracy.

This project differs by evaluating the full RAG pipeline — from crawl through chunk, embed, retrieve, and LLM answer — across 7 tools, 8 sites, and 5 dimensions including downstream answer quality and cost at scale.

Self-Improvement Framework

The self_improvement/ directory contains a 9-spec review framework for auditing benchmark quality. See self_improvement/MASTER.md.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
docs		docs
reports		reports
runners		runners
self_improvement		self_improvement
tests		tests
tools		tools
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
IMPROVE.md		IMPROVE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
benchmark_all_tools.py		benchmark_all_tools.py
benchmark_answer_quality.py		benchmark_answer_quality.py
benchmark_markcrawl.py		benchmark_markcrawl.py
benchmark_pipeline.py		benchmark_pipeline.py
benchmark_quality.py		benchmark_quality.py
benchmark_retrieval.py		benchmark_retrieval.py
crawlee_worker.py		crawlee_worker.py
lint_reports.py		lint_reports.py
preflight.py		preflight.py
pyproject.toml		pyproject.toml
quality_scorer.py		quality_scorer.py
run_benchmarks.sh		run_benchmarks.sh
test_crawl4ai_graduated.py		test_crawl4ai_graduated.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-crawler-benchmarks

Which web crawler is best for LLM/RAG pipelines? We tested 7 tools across 8 sites to find out.

Key Findings

Tools Compared

Sites Tested

Reports

Quick Start

Docker

Related Work

Self-Improvement Framework

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-crawler-benchmarks

Which web crawler is best for LLM/RAG pipelines? We tested 7 tools across 8 sites to find out.

Key Findings

Tools Compared

Sites Tested

Reports

Quick Start

Docker

Related Work

Self-Improvement Framework

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages