Skip to content

lula73/bot-detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

23 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Public Blacklist for malicious bots & scrapers

A curated list (+17.600) of IP addresses identified as bots, scrapers, and malicious actors. Updated every week.
Start to protect your content!

Update Frequency License

About This Project

This blacklist is generated by a proprietary Bot Detection System that monitors and analyzes web traffic from a panel of major European publishers, processing approximately 500 million page views per month on multiple domains.

Why is the source code not public?

The Bot Detector source code, detection rules, and behavioral patterns are intentionally kept private.

If we published the detection algorithms and scoring rules, malicious bots could easily analyze them and implement countermeasures to evade detection. By keeping the detection logic confidential, we maintain the effectiveness of the system against sophisticated bot operators who continuously adapt their techniques.

Bot Detector

What we share publicly:

  • The resulting blacklist of detected malicious IPs
  • Multiple formats for easy integration with your infrastructure
  • Statistics and metadata about detected threats

What remains private:

  • Detection algorithms and scoring logic
  • Behavioral analysis rules
  • Pattern matching configurations
  • Traffic analysis methodologies

Overview

The Bot Detector system analyzes traffic patterns and behaviors to identify:

Known Bots (Self-Identified)

These bots explicitly identify themselves via User-Agent:

Category Description
Known Bot (AI Crawler) GPTBot, ClaudeBot, ChatGPT-User, PerplexityBot, Bytespider
Known Bot (SEO) SemrushBot, AhrefsBot, MJ12bot, DotBot, DataForSeoBot
Known Bot (Scraper) CCBot, Scrapy, Diffbot, news-please
Known Bot (Search Engine) Yandex, Sogou, Baidu, PetalBot

Behavior-Based Detection

These are detected through traffic analysis without explicit identification:

Category Description
Vulnerability Scanner IPs probing for .env, wp-admin, phpinfo, credentials
Content Theft Illegal scraping with burst patterns on pagination/archives
Archive Scraper Systematic scraping of old articles via deep pagination
Image Scraper High percentage of image requests without referrers
Aggressive Scanner Very high request volume (>80 RPM)
DDoS Source Extremely high request volume (>200 RPM)

Infrastructure-Based Detection

Category Description
Proxy/VPN Abuse Traffic from detected proxy/VPN services
Hosting/Cloud Bot Automated traffic from cloud/hosting infrastructure

Available Formats

JSON (Complete Data)

Contains full metadata including score, country, organization, category, and reason.

curl -O https://raw.githubusercontent.com/lula73/bot-detector/master/blacklist.json

Nginx

# Download
curl -O https://raw.githubusercontent.com/lula73/bot-detector/master/nginx/deny.conf

# Include in your nginx.conf or site config
include /etc/nginx/deny.conf;

# Reload nginx
sudo nginx -t && sudo nginx -s reload

Apache

# Download
curl -O https://raw.githubusercontent.com/lula73/bot-detector/master/apache/.htaccess

# Include in your httpd.conf or use directly in web root
# Apache 2.4+ required

iptables

# Download and execute
curl -O https://raw.githubusercontent.com/lula73/bot-detector/master/iptables/rules.sh
chmod +x rules.sh
sudo ./rules.sh

Cloudflare

  1. Download cloudflare/ip_list.txt
  2. Go to Cloudflare Dashboard > Security > WAF > Tools
  3. Use "IP Access Rules" to bulk import

WordPress

// In your theme's functions.php or custom plugin
require_once('/path/to/blocked_ips.php');

Features of WordPress integration:

  • Automatic IP detection (supports Cloudflare, proxies)
  • Returns 403 Forbidden with custom headers
  • Runs before any output (priority 1)

HAProxy

# In haproxy.cfg
frontend web_frontend
    acl blacklist src -f /etc/haproxy/blacklist.list
    http-request deny if blacklist

Automation

Cron Job (Update every 3 hours)

# Nginx example
0 */3 * * * curl -s https://raw.githubusercontent.com/lula73/bot-detector/master/nginx/deny.conf -o /etc/nginx/deny.conf && nginx -s reload

# iptables example
0 */3 * * * curl -s https://raw.githubusercontent.com/lula73/bot-detector/master/iptables/rules.sh | sudo bash

systemd Timer

Create /etc/systemd/system/update-blacklist.service:

[Unit]
Description=Update Bot Detector Blacklist

[Service]
Type=oneshot
ExecStart=/usr/bin/curl -s https://raw.githubusercontent.com/lula73/bot-detector/master/nginx/deny.conf -o /etc/nginx/deny.conf
ExecStartPost=/usr/sbin/nginx -s reload

Create /etc/systemd/system/update-blacklist.timer:

[Unit]
Description=Update blacklist every 3 hours

[Timer]
OnCalendar=*-*-* 0/3:00:00
Persistent=true

[Install]
WantedBy=timers.target

Enable: sudo systemctl enable --now update-blacklist.timer

Statistics

Check the stats/ directory for:

  • categories.json - Breakdown by bot category
  • countries.json - Geographic distribution

API Usage

You can also fetch the JSON directly in your application:

import requests

response = requests.get('https://raw.githubusercontent.com/lula73/bot-detector/master/blacklist.json')
data = response.json()

for entry in data['blacklist']:
    print(f"{entry['ip']} - {entry['category']} - {entry['reason']}")

Data Fields

Each entry in blacklist.json contains:

Field Type Description
ip string IP address
score integer Bot score (0-100)
country string Country code (ISO 3166-1 alpha-2)
organization string ISP/hosting provider
category string Bot category
reason string Human-readable block reason
first_seen string First detection date (YYYY-MM-DD)
last_seen string Last detection date (YYYY-MM-DD)
scan_count integer Number of scans/detections
is_permanent boolean Permanently blocked

Contributing

Found a false positive? Please open an issue with:

  1. The IP address
  2. Your use case
  3. Any relevant logs

License

MIT License - See LICENSE file.

Disclaimer

This blacklist is provided as-is. False positives may occur. Always test in a staging environment before deploying to production.

Not affiliated with any of the bot operators mentioned.


Update Frequency: Every 3 hours Source: Bot Detector v2.4 Last Generated: Check metadata.generated_at in blacklist.json

About

🚫 IP list to block bots, scrapers, AI crawlers & malicious traffic. +17.000 IP Updated every week.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors