Regular expressions to describe incoming search engine URLs?
January 5, 2006 9:56 AM   Subscribe

HtAccessFilter: I’m seeking regular expressions to describe URLs that encompass search engines’ image and video search engines, but not the engines themselves, so that I can block said image and video search engines using “SetEnvIfNoCase Referer” in .htaccess. Also, I’m also seeking to block all incoming requests for one particular URL which is popular with these video and image search engines but which no longer exists – but I don’t want to serve them the normal 404.

To be specific, I’m seeking to block most image and video search engines from my website, while not excluding the main search engines themselves (e.g., I want Google Images and all its regional variations blocked, but not Google itself; Yahoo Video Search, but not Yahoo; etc.). I managed to get a good variation for Google Images:

SetEnvIfNoCase Referer "^https?://(www\.)?images.google.(ae|at|be|ca|ch|cl|co\.hu|co\.il|co\.in|co\.jp|co\.kr|co\.nz|co\.th|co\.uk|co\.za|com|com\.ar|com\.au|com\.br|com\.fn|com\.gr|com\.hk|com\.mx|com\.my|com\.ph|com\.pr|com\.ru|com\.sg|com\.tr|com\.tw|com\.ua|de|dk|fi|fr|gr|ie|it|lv|nl|pl|pt|ro|se|sk)" DumbSearchEngine=1

However, I did that not through my own very bad knowledge of regular expressions, but by emulating what I saw elsewhere. I just don't have the know-how or the adequate reference to stave off the other search engines, really.

I’m now hoping others have worked out similar ways of describing and/or blocking things like AltaVista Video, Yahoo Video Search, and so on, without blocking Google, AltaVista, and Yahoo themselves. I also do not want to go too wide by blocking anything with ‘images’ or ‘video’ in the name itself, for example.

I then can direct them to a Forbidden error, so that they never even hit my StatCounter, which is the ultimate goal. I'm not a heavy multimedia website -- for some reason, if I am linking to an MP3 or AVI on another website, Yahoo, AltaVista, Google are all deciding that I'M hosting the image and linking their image- or video-search to me. Bastards.

Thanks.
posted by WCityMike to Computers & Internet (6 answers total)
 
Checking the "referer" field isn't going to do a thing for stopping spiders. They don't have a referer. What it would stop is someone clicking on a link found in the results of a search spider, but it won't stop the spider itself.

You should use the proper method for this which is robots.txt. More info here. All the major search engines are well-behaived and will follow robots.txt exclusions, so that is the proper way to keep spiders out. And since (for google at least) the main spider and the images spider have different user-agents, it's the perfect way to discriminate.

BTW the error page that you send to a spider should be completely irrelevent - all it sees is the 404, it doesn't care what the body of the page has.
posted by Rhomboid at 10:13 AM on January 5, 2006


[a bit crossly] Yes, but that's not what I asked.

I'm handling robots.txt fine. Problem is that a lot of search engines don't seem to update their spiders' behavior for weeks or months at a time, it seems. But, I have addressed that core problem, yes, and we'll see if the respective search engines actually ever pay attention to the requests I already made through their proper channels.

In the meantime, I WANT to stop someone clicking on a link found in the results of a search spider, and in the manner described, because it throws them to a Apache Forbidden page that never loads my StatCounter, thus preventing StatCounter from counting the hit or registering the referral in its logs.

Continuing on ... anyone?
posted by WCityMike at 10:29 AM on January 5, 2006


Why do you need all the suffixes? Isn't images.google.* equally valid and easier/shorter?

To block my images from coming up on other sites (myspace is a scourge to my bandwidth, as is google image search, because I have album covers for CDs we review), I do the opposite of what you're doing, but you should be able to interpret the idea...:
RewriteEngine On

RewriteCond %{REQUEST_FILENAME} .*jpg$|.*gif$|.*png$ [NC]
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !openingbands\.com [NC]
RewriteCond %{HTTP_REFERER} !www\.openingbands\.com [NC]
RewriteCond %{HTTP_REFERER} !66.225.227.18 [NC]

RewriteRule (.*) /nohotlink.gif
The ! before the condition means NOT, so... you'd want to omit that, and instead of if the referrer is NOT me, you'd say if the referrer IS images.google.* then rewrite...


This of course assumes that you have mod_rewrite installed.
posted by twiggy at 11:13 AM on January 5, 2006


Apologies. I should add that that second line basically is "if the file being requested is a jpg, gif, or png file" .. because I'm just trying to block remote image hotlinking... You don't need that line.
posted by twiggy at 11:23 AM on January 5, 2006


Just so know, twiggy, that first condition could be rewritten as:

RewriteCond %{REQUEST_FILENAME} \.(jpg|gif|png)$ [NC]

I don't know if it's faster or not, but it's certainly easier on the pattern matcher without all those (.*)'s.
posted by sbutler at 11:54 AM on January 5, 2006


Good call sbutler.. thanks... When I needed it, I just copied/pasted code in a fit of laziness (found via google), and even though I understand it (programmer here...) I didn't really bother to consider ways of improving it :-)
posted by twiggy at 11:54 PM on January 5, 2006


« Older Is there a term for overestimating the importance...   |   miner safety Newer »
This thread is closed to new comments.