Skip to content

Releases: AIMLPM/markcrawl

v0.2.0 — Async engine with process-pool extraction

12 Apr 18:20

Choose a tag to compare

What's new

3x faster crawling — async I/O + ProcessPoolExecutor bypass the GIL for true parallel HTML extraction.

Performance

  • Async httpx engine replaces sequential requests — concurrent fetches with asyncio.gather
  • ProcessPoolExecutor offloads CPU-bound BeautifulSoup + markdownify to separate processes
  • Streaming pipeline via asyncio.as_completed — pages save as they arrive, no batch-wait
  • Benchmark: 15.7 pages/sec at concurrency=5 (up from 3.4 p/s in v0.1.1)

How to upgrade

pip install --upgrade markcrawl

The async engine activates automatically when httpx is installed:

pip install markcrawl[http2]

Or use it directly:

from markcrawl import crawl
result = crawl("https://example.com", out_dir="output", concurrency=5)

Full changelog

v0.1.1...v0.2.0

v0.1.1

09 Apr 07:57

Choose a tag to compare

What's changed

  • Benchmarks split into separate repo: AIMLPM/llm-crawler-benchmarks
  • README benchmark links now point to the new repo
  • CLAUDE.md trimmed to crawler-only rules
  • Makefile, CI, .gitignore scoped to crawler code only
  • Dockerfile cleaned up (non-root user, no benchmark references)
  • CONTRIBUTING.md links to benchmark repo

No code changes to the crawler itself.