Is there a blacklist of misbehaving web robots?
October 2, 2004 7:36 AM   Subscribe

Is there a blacklist of misbehaving web robots? [MI]

Site I'm responsible for has been visited recently by robots that are either obviously badly written (requesting files from my server that are obviously a list of files from someone else's) or ignoring my robots.txt file. So far, only minor annoyances, but I'm hoping there's a way to ban known misbehavers pre-emptively before something more major happens.
posted by gimonca to Computers & Internet (4 answers total)
 
Well, here's something I might try.

add a script to your robots.txt. Call it "/dont/load/this/up/or/be/banned/forever.php". Maybe put it at the top of your robots.txt. Have it insert a line to your firewall blocking the IP address calling it. Problem solved, crisis averted. :-)

In fact, I might just do that for my servers. Hmmm...
posted by shepd at 8:06 AM on October 2, 2004


Here's a couple of helpful links that got me on the right track (although I've been bad about keeping the list up to date):

1. diveintomark.org: How to block spambots, ban spybots, and tell unwanted robots to go to hell

2. Which points to Webmasterworld: A Close to perfect .htaccess ban list (on page 25 in the last post, there are links to 2 newer threads on the subject)

Hope that helps! I certainly feel your pain and frustration.
posted by misterioso at 8:52 AM on October 2, 2004


Expounding on shep's comment, it might be advantageous to use an .htaccess redirect to send any request for robots.txt directly to the script that updates blocked robots list. It's probably easier to turn this into a whitelist, and selectively unban any robots you want to let through. Unfortunately, as you've noticed, a great number of them ignore or never bother requesting robots.txt.

For the past few years, I've been maintaining a list of robots and/or questionable UserAgents I've seen hit my domains. This list is categorized by type (search engines, e-mail harvesters, frequently abusive dev tools, etc.) and fed into a PHP script which I include at the top of every page (or more acurately, at the top of the template which drives every page.) If the UserAgent requesting a page is on the list, the script prevents any output whatsoever; the robot sees nothing but a blank page.

Not only does this mitigate potential bandwidth usage, but it also prevents scrape-able data of any form from being entered into their databases. I've seen zero performance hit from including said script in my pages.

I've been meaning to clean it up and release the script to the public. If you have any interest and can wait a few hours, lemme know.
posted by Danelope at 6:18 PM on October 2, 2004


Well, one way or another, said script (clientblock) can be downloaded here.
posted by Danelope at 9:48 PM on October 2, 2004


« Older Milk Spoilage   |   Automotive engineering software Newer »
This thread is closed to new comments.