How to Detect and Block Abusive Web Crawlers

SaM BaZzIBy Sam Bazzi, technologist

Ever wondered how to identify the IPs with the most hits on your web server (or, in other words, website)? Perhaps you want to identify the most active human users of your website(s) or abusive web robots. In all cases, the answer, of course, lies in your web server access log file! Here are the Linux/Unix commands I have been using to periodically detect digital culprits or enthusiastic users:

> cat <access-log-filename> | cut -d” ” -f1 | sort -n | uniq -c | sort -rn | head -n 10

These piped Linux/Unix commands provide me with a sorted list of IPs with the most hits as registered in the access log file. I can then run the whois command on the IPs to determine whether they are legitimate visitors (e.g., Google’s robots) or not (unwanted crawlers). To block offending IPs, you can use the iptables command:

> iptables -I INPUT -j DROP -s <ip-address>

Enjoy!

  • Nice post! I recently set up a small blog and thought I'd check out the access log and was shocked at the amount of traffic hammering the site, trying to guess the password. Your one-liner was perfect and just what I needed to sort the log, and quickly banish the malbot! Thanks!

  • Drake

    Thank you, nice simple. I did need to adjust one thing. On my debian servers I needed to change the quotes form double to single cut -dā€ ā€ to cut -d’ ‘