Releases: AIMLPM/markcrawl
Releases · AIMLPM/markcrawl
v0.2.0 — Async engine with process-pool extraction
What's new
3x faster crawling — async I/O + ProcessPoolExecutor bypass the GIL for true parallel HTML extraction.
Performance
- Async httpx engine replaces sequential requests — concurrent fetches with
asyncio.gather - ProcessPoolExecutor offloads CPU-bound BeautifulSoup + markdownify to separate processes
- Streaming pipeline via
asyncio.as_completed— pages save as they arrive, no batch-wait - Benchmark: 15.7 pages/sec at concurrency=5 (up from 3.4 p/s in v0.1.1)
How to upgrade
pip install --upgrade markcrawlThe async engine activates automatically when httpx is installed:
pip install markcrawl[http2]Or use it directly:
from markcrawl import crawl
result = crawl("https://example.com", out_dir="output", concurrency=5)Full changelog
v0.1.1
What's changed
- Benchmarks split into separate repo: AIMLPM/llm-crawler-benchmarks
- README benchmark links now point to the new repo
- CLAUDE.md trimmed to crawler-only rules
- Makefile, CI, .gitignore scoped to crawler code only
- Dockerfile cleaned up (non-root user, no benchmark references)
- CONTRIBUTING.md links to benchmark repo
No code changes to the crawler itself.