Regular expressions to describe incoming search engine URLs?
January 5, 2006 9:56 AM
Subscribe
HtAccessFilter: I’m seeking regular expressions to describe URLs that encompass search engines’ image and video search engines, but not the engines themselves, so that I can block said image and video search engines using “SetEnvIfNoCase Referer” in .htaccess. Also, I’m also seeking to block all incoming requests for one particular URL which is popular with these video and image search engines but which no longer exists – but I don’t want to serve them the normal 404.
To be specific, I’m seeking to block most image and video search engines from my website, while not excluding the main search engines themselves (e.g., I want Google Images and all its regional variations blocked, but not Google itself; Yahoo Video Search, but not Yahoo; etc.). I managed to get a good variation for Google Images:
SetEnvIfNoCase Referer "^https?://(www\.)?images.google.(ae|at|be|ca|ch|cl|co\.hu|co\.il|co\.in|co\.jp|co\.kr|co\.nz|co\.th|co\.uk|co\.za|com|com\.ar|com\.au|com\.br|com\.fn|com\.gr|com\.hk|com\.mx|com\.my|com\.ph|com\.pr|com\.ru|com\.sg|com\.tr|com\.tw|com\.ua|de|dk|fi|fr|gr|ie|it|lv|nl|pl|pt|ro|se|sk)" DumbSearchEngine=1
However, I did that not through my own very bad knowledge of regular expressions, but by emulating what I saw elsewhere. I just don't have the know-how or the adequate reference to stave off the other search engines, really.
I’m now hoping others have worked out similar ways of describing and/or blocking things like AltaVista Video, Yahoo Video Search, and so on, without blocking Google, AltaVista, and Yahoo themselves. I also do not want to go too wide by blocking anything with ‘images’ or ‘video’ in the name itself, for example.
I then can direct them to a Forbidden error, so that they never even hit my StatCounter, which is the ultimate goal. I'm not a heavy multimedia website -- for some reason, if I am linking to an MP3 or AVI on another website, Yahoo, AltaVista, Google are all deciding that I'M hosting the image and linking their image- or video-search to me. Bastards.
Thanks.
posted by WCityMike to computers & internet (6 comments total)
You should use the proper method for this which is robots.txt. More info here. All the major search engines are well-behaived and will follow robots.txt exclusions, so that is the proper way to keep spiders out. And since (for google at least) the main spider and the images spider have different user-agents, it's the perfect way to discriminate.
BTW the error page that you send to a spider should be completely irrelevent - all it sees is the 404, it doesn't care what the body of the page has.
posted by Rhomboid at 10:13 AM on January 5, 2006