Skip to content

AIMLPM/llm-crawler-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-crawler-benchmarks

Which web crawler is best for LLM/RAG pipelines? We tested 7 tools across 8 sites to find out.

CI License

Head-to-head benchmark suite comparing web crawlers on speed, extraction quality, retrieval quality, LLM answer quality, and cost at scale. Every benchmark is reproducible from a single command.

Key Findings

Dimension Winner Key metric Runner-up
Speed scrapy+md 9.1 pages/sec colly+md (5.8 p/s)
Extraction quality markcrawl 100% content signal, 4 words preamble scrapy+md (97%, 23 words)
Retrieval quality crawlee 85% Hit@5, 0.787 MRR playwright (85%, 0.787)
LLM answer quality markcrawl 3.91/5 overall score scrapy+md (3.86/5)
Cost at scale markcrawl $4,505/yr (100K pages, 1K q/day) scrapy+md ($5,464/yr)
Pipeline timing colly+md 259.6s end-to-end scrapy+md (268.0s)

Bottom line: No single tool wins everything. scrapy+md and colly+md are fastest, but markcrawl produces the cleanest output, the best LLM answers, and the lowest cost at scale (21-66% savings on RAG infrastructure). Retrieval quality barely differs between tools — switching retrieval mode (e.g., to reranked) gains 15-20 points, while switching crawlers gains ~5.

Tools Compared

Tool Type JS rendering Notes
markcrawl Python Optional Markdown-first, lowest preamble
scrapy+md Python No Fastest raw HTTP crawler
crawl4ai Python Built-in AI-native, browser-based
crawl4ai-raw Python Built-in crawl4ai with raw HTML output
colly+md Go No Fast compiled crawler
crawlee Python Built-in Apify's browser crawler
playwright Python Built-in Microsoft's browser automation

All tools output Markdown via the same html-to-markdown pipeline (except crawl4ai-raw). See METHODOLOGY.md for tool configurations and fairness decisions.

Sites Tested

Site Pages Type
quotes.toscrape.com 15 Simple paginated HTML
books.toscrape.com 60 E-commerce catalog
fastapi.tiangolo.com 153 API docs (code blocks, tutorials)
docs.python.org 500 Standard library reference
react.dev 500 SPA, JS-rendered
en.wikipedia.org 50 Tables, infoboxes, citations
docs.stripe.com 500 Tabbed content, code samples
github.blog 200 Blog articles, images

Reports

Report Question it answers
Speed Comparison Which crawler is fastest?
Quality Comparison Which produces the cleanest Markdown?
Retrieval Comparison Does cleaner Markdown improve retrieval?
Answer Quality Does better retrieval improve LLM answers?
Cost at Scale What does each crawler cost at 100K+ pages?
Pipeline Timing How long does the full RAG pipeline take?
MarkCrawl Self-Benchmark MarkCrawl standalone performance
Methodology How were these benchmarks run?

Quick Start

# Install dependencies
pip install -e ".[dev]"

# Preflight check (verifies all tools are installed)
python preflight.py

# Run all benchmarks (~3-5 hours)
python benchmark_all_tools.py

# Run individual benchmarks
python benchmark_quality.py
python benchmark_retrieval.py
python benchmark_answer_quality.py
python benchmark_pipeline.py
python benchmark_markcrawl.py

Docker

docker build -t llm-crawler-benchmarks .
docker run --rm \
  -e OPENAI_API_KEY \
  -v $(pwd)/reports:/app/reports \
  -v $(pwd)/runs:/app/runs \
  llm-crawler-benchmarks

Related Work

Other projects benchmark parts of the web scraping pipeline:

  • Firecrawl scrape-evals — 1,000-URL extraction quality benchmark (precision/recall). Single-page quality only; no speed, retrieval, or LLM answer evaluation.
  • WCXB — 2,008-page content extraction leaderboard with word-level F1. Covers traditional tools (trafilatura, readability) but not LLM-era crawlers.
  • Spider.cloud benchmark — 3-tool comparison (Firecrawl, Crawl4AI, Spider) on throughput, cost, and RAG retrieval accuracy.

This project differs by evaluating the full RAG pipeline — from crawl through chunk, embed, retrieve, and LLM answer — across 7 tools, 8 sites, and 5 dimensions including downstream answer quality and cost at scale.

Self-Improvement Framework

The self_improvement/ directory contains a 9-spec review framework for auditing benchmark quality. See self_improvement/MASTER.md.

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages