Skip to content

Conversation

@Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Dec 5, 2025

We hit a lot of Cloudflare challenges on ehp.niehs.nih.gov last month that caused problems and hit them occasionally on other sites. Like AWS WAF challenges, these are not really good/valid/legitimate captures that should be considered as changed content. This adds heuristics for them to the maybe_bad_capture() method so we don’t use them in changes.

For more on the Cloudflare challenges, see:

We hit a lot of Cloudflare challenges on ehp.niehs.nih.gov last month that caused problems and hit them occasionally on other sites. Like AWS WAF challenges, these are not really good/valid/legitimate captures that should be considered as changed content. This adds heuristics for them to the `maybe_bad_capture()` method so we don’t use them in changes.

For more on the Cloudflare challenges, see:
- edgi-govdata-archiving/web-monitoring#189
- edgi-govdata-archiving/web-monitoring-crawler#32
@Mr0grog Mr0grog merged commit c1936f5 into main Dec 15, 2025
3 checks passed
@Mr0grog Mr0grog deleted the cloudflare-challenges-are-not-the-webpages-you-are-looking-for branch December 15, 2025 00:20
@github-project-automation github-project-automation bot moved this from Inbox to Done in Web Monitoring Dec 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants