Self-hosted Node.js package — unified web layer for AI agents. One API for search, browse, crawl, auth, and platform actions. 5,000 free searches/month via Gemini Grounded Search. Open source, MIT, npm installable.
Block detection for 15+ anti-bot patterns. Automatic fallback chain: Reddit via PullPush API, Amazon via Jina Reader. LinkedIn reading unsolved (needs residential proxy).
github.com/FayAndXan/spectrawl (public)
- npm:
[email protected] - Dockerfile: node:22-slim, port 3900
- Spectrawl systemd service:
spectrawl.service, localhost:3900, auto-restart - WorkingDirectory:
/root/.openclaw/workspace-dijiclaw/projects/spectrawl - GITHUB_TOKEN + GEMINI_API_KEY in service env
- Old proxy dead (
204.252.81.197:46620). Proxy rotation system built but no working upstream.
| Site | Status | Method |
|---|---|---|
| GitHub, X, blogs | ✅ Works | Camoufox direct browse |
| ✅ Solved | PullPush API fallback (free, no auth) | |
| Amazon | ✅ Solved | Jina Reader fallback |
| ❌ Blocked | Needs residential proxy — all free paths exhausted | |
| Cloudflare sites | Block flagged but no workaround |
15+ patterns: Reddit, Amazon, LinkedIn, Cloudflare (inc. RFC 9457), Akamai, AWS WAF, Imperva, DataDome, PerimeterX, hCaptcha, reCAPTCHA, Google consent. Plus content quality heuristic (<100 chars from known-large sites).
Pre-routes known-blocked sites through alternative APIs before wasting time on Playwright:
- Reddit: PullPush API — subreddit listings, threads + comments, search
- Amazon: Jina Reader — renders pages server-side, returns markdown
- Returns
blocked: truewith actionable error message when no fallback works
- Camoufox-only, auto-parallel based on RAM (~250MB/tab)
- fastMode: 400ms wait + instant scroll
- Async jobs: POST with
async:true, poll GET/crawl/{jobId} - Sitemap crawling enabled by default
- Performance: 10 pages in 14s, ~200 pages in 3 min
- Structured extraction (
/extract) — schema-driven, Gemini Flash - AI browser agent (
/agent) — simplified DOM, 100 element cap - Network capture (XHR/fetch only)
- Sitemap crawling (enabled by default)
- Webhook notifications (fire-and-forget)
- BM25 relevance filtering
- Search: 8 engines, deep search, source ranking, scraping
- Browse: 3-tier stealth (Playwright → Camoufox → Remote) + site overrides
- Crawl: parallel, async jobs, RAM-aware, fast mode
- HTTP Server: /search, /browse, /crawl, /extract, /agent, /act, /status, /health
- MCP Server: stdio transport, 5 tools
- Auth: SQLite cookies, 5 accounts stored (Reddit, IndieHackers, etc.)
- CAPTCHA: stealth bypass → Gemini Vision → unsolvable
- Adapters: 24 total
- Rate limiter + dedup, Form filler
- Proxy rotation system (built, needs working upstream)
src/browse/index.js— browse engine, block detection, site overridessrc/crawl.js— crawl engine v2src/search/index.js— search engine, deepSearchsrc/server.js— HTTP server (port 3900)src/extract.js— structured extractionsrc/agent.js— AI browser agentsrc/proxy/index.js— rotating proxy gatewaysrc/act/adapters/*.js— 24 platform adapters
3a9f986— v0.6.2: Reddit PullPush API fallback4376a39— v0.6.1: block detection + Amazon Jina fallbacke5cbb9d— v0.6.0: extract, agent, network capture, sitemap, webhook53b46bf— README comprehensive rewrite
- Residential proxy for LinkedIn + Cloudflare sites (Smartproxy $7/GB recommended)
- Post content to platforms (drafts written, zero published)
- Streaming/SSE for long operations
- Agent
_getSimplifiedDOMoptimization - Process GitHub competitor research (stagehand, crawl4ai, masa-finance/crawler)