Tag Archives: Robots

How to Detect and Block Abusive Web Crawlers

SaM BaZzIBy Sam Bazzi, technologist

Ever wondered how to identify the IPs with the most hits on your web server (or, in other words, website)? Perhaps you want to identify the most active human users of your website(s) or abusive web robots. In all cases, the answer, of course, lies in your web server access log file! Here are the Linux/Unix commands I have been using to periodically detect digital culprits or enthusiastic users:

> cat <access-log-filename> | cut -d” ” -f1 | sort -n | uniq -c | sort -rn | head -n 10

These piped Linux/Unix commands provide me with a sorted list of IPs with the most hits as registered in the access log file. I can then run the whois command on the IPs to determine whether they are legitimate visitors (e.g., Google’s robots) or not (unwanted crawlers). To block offending IPs, you can use the iptables command:

> iptables -I INPUT -j DROP -s <ip-address>

Enjoy!