detectBlockPage()insrc/browse/index.js— 15+ patterns covering Reddit, Amazon, LinkedIn, Cloudflare, Akamai, AWS WAF, Imperva, DataDome, PerimeterX, hCaptcha, reCAPTCHA, Google consent- Content quality heuristic: flags <100 chars from known-large sites list as
suspected-block - Runs post-browse on every page, sets
blocked: true+blockType+blockDetailon result
_getSiteOverride(url)returns site-specific fallback function- Runs BEFORE Playwright attempt — if fallback has content, skip browser entirely
- If fallback confirms blocked, return immediately with actionable error (don't waste time on Playwright)
- Current overrides: Reddit (PullPush API), Amazon (Jina Reader)
api.pullpush.io— free Reddit archive, no auth, not IP-blocked- Parses Reddit URLs:
/r/{sub},/r/{sub}/comments/{id}, search queries - Returns formatted markdown: titles, scores, authors, selftext, comments
- Limitation: archive data, not real-time. Good enough for research.
r.jina.ai/{url}— renders page server-side, returns markdown- Works for product pages that block with CAPTCHA
- Falls back only when content > 100 chars and doesn't contain block strings
- Unsolved from datacenter IPs. Every path tested and failed:
- Direct browse: HTTP 999
- Voyager API with valid cookies: 401 (IP fingerprinting)
- Facebook/Googlebot UA: 317K CSS shell, zero content
- Jina Reader: empty
- No public archive API exists
- Needs residential proxy. Smartproxy ($7/GB) recommended.
src/proxy/index.js— rotating proxy gateway already built- Supports round-robin/random/least-used strategies
- Old proxy URL dead. Need new upstream.
- Don't invent fake problems to sell the tool
- Agents recognize block pages — they don't "summarize garbage as real content"
- The real value is fallbacks that GET THE CONTENT, not just detecting blocks
- Posts should be first-person honest ("I had this problem, I fixed it")