Releases · AIMLPM/markcrawl

What's new

3x faster crawling — async I/O + ProcessPoolExecutor bypass the GIL for true parallel HTML extraction.

Async httpx engine replaces sequential requests — concurrent fetches with asyncio.gather
ProcessPoolExecutor offloads CPU-bound BeautifulSoup + markdownify to separate processes
Streaming pipeline via asyncio.as_completed — pages save as they arrive, no batch-wait
Benchmark: 15.7 pages/sec at concurrency=5 (up from 3.4 p/s in v0.1.1)

pip install --upgrade markcrawl

The async engine activates automatically when httpx is installed:

pip install markcrawl[http2]

Or use it directly:

from markcrawl import crawl
result = crawl("https://example.com", out_dir="output", concurrency=5)

What's changed

Benchmarks split into separate repo: AIMLPM/llm-crawler-benchmarks

README benchmark links now point to the new repo

CLAUDE.md trimmed to crawler-only rules

Makefile, CI, .gitignore scoped to crawler code only

Dockerfile cleaned up (non-root user, no benchmark references)

CONTRIBUTING.md links to benchmark repo

No code changes to the crawler itself.