How do I block Googlebot and other bots from indexing a certain part of a webpage?
June 9, 2008 7:04 PM   Subscribe

How do I block Googlebot and other bots from indexing a certain part of a webpage?

I run an online music magazine. On every article, show review and feature, we use one column on the webpage to list 15-20 upcoming shows and venue information. Unfortunately, due to the repetitive text, Google inevitably thinks each webpage is about the 15-20 shows listed (particularly the repeated venue information) and not about the article, show review or feature.

We're now using other techniques to emphasize the content in the articles (title tags, header tags, meta description/keywords, bolding) but I'd really like it if I could exclude the column with the upcoming shows and venue information from being indexed. If that could be achieved, Google might instead focus on the true content and the keywords inside!

Any ideas?
posted by jrholt to Computers & Internet (16 answers total)
 
robots.txt
posted by knave at 7:12 PM on June 9, 2008


Apparently, you can just add a META tag (I didn't know they were relevant) in your <HEAD> section:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

Another way, if your host supports it, is through .htaccess
posted by spiderskull at 7:12 PM on June 9, 2008


Response by poster: Robots.txt and .htaccess seem only valid if you want to block bots from indexing your *entire* page. I want to exclude only a certain part of the page.

Am I missing something?
posted by jrholt at 7:20 PM on June 9, 2008


You can't do that with just "robots.txt" or the .htaccess file. The only way to do it is to have 2 versions of your page. In the code of the page, you have to detect the type of browser and if the browser signature is Googlebot, you don't display certain parts of the page. It's called cloaking.
posted by McSly at 7:27 PM on June 9, 2008


" I want to exclude only a certain part of the page. Am I missing something?"

Don't serve that part of the page to Google. The address ranges and User-Agent headers are easily recognizable, just serve up that section of the page conditionally.
posted by majick at 7:27 PM on June 9, 2008


Just wondering: would using "nofollow" on the pages with the show lists (or rel=nofollow on the show links) reduce the effect those links have on the page's overall rank?
posted by Lazlo at 7:33 PM on June 9, 2008


Oh I see, I misunderstood the question. Does anyone here know if the Google bots run Javascript at all? Because if they don't, you can look into adding the show information via a client-side include.

Scroll down about midway on this page, which shows how you can include other files. Then make the included file blocked via .htaccess or robots.txt.
posted by spiderskull at 7:39 PM on June 9, 2008


use javascript to write the non-indexable info to the page on the fly! the googlebot won't get that.
posted by soma lkzx at 8:15 PM on June 9, 2008


Best answer: I'm pretty sure that there's no way to accomplish this that will ultimately make Google happy with you. You might find a method that'll work for a time, but the bot is ever-evolving, and when Google finds that you're trying to hide things from Googlebot, I gather they go a bit nuclear on your pagerank under the assumption that you have ulterior motives. You don't want that to happen, obviously. Better if you continue the white hat methods you're currently using.

Googlebot makes some assumptions about content importance based on its position within the document. Closer to the top = more topical, closer to the bottom = not so much. So, if you can make the column of gigs one of the very last things in your html, and then position it where you want it via CSS, that might help to reduce its importance in the eyes of Googlebot.

If you're not already, get hooked up with Google Webmaster Tools and start submitting site maps, to help the bot find its way around. Also read the Google Webmaster Central Blog.

When I was running Google Ads, there was a way to assign page areas different weights, to help the AdSense engine determine what your page was most about, so that it could more closely target the ads. I'm not sure if Googlebot pays any attention to that or not, but it might be worth investigating.

If all else fails, you might consider presenting the textual upcoming show content in a non-text format that the bot can't parse. Graphic, Flash, etc. It's certainly not ideal, but you could develop something to automate image generation server-side.
posted by mumkin at 8:48 PM on June 9, 2008


Here's what my robots.txt file looks like:

User-Agent: *
Disallow: /

User-Agent: Googlebot
Disallow: /magicdir1/
Disallow: /magicdir2/
Allow: /

User-Agent: MSNBot
Disallow: /magicdir1/
Disallow: /magicdir2/
Allow: /

User-Agent: AskJeeves
Disallow: /magicdir1/
Disallow: /magicdir2/
Allow: /

The first two lines kill all compliant bots except the three I explicitly permit. ("AskJeeves" is answer.com.) You can have as many disallow lines as you want, followed by the "Allow: /" and the bot in question will avoid all the disallows and do everything else.

The path is from the web root. On my linux server, that's "/home/groups/home/web". So if the above was interpreted literally, it would exclude /home/groups/home/web/magicdir1/ and /home/groups/home/web/magicdir2/ but permit the Googlebot to see everything else in the web directory.

It isn't a perfect solution. Many bots ignore the robots.txt file entirely. Oddly, some of them read it, and then ignore it. But my experience is that those three are well behaved, because I've stopped getting search hits on the directories I excluded this way from all three of them.
posted by Class Goat at 9:03 PM on June 9, 2008 [1 favorite]


By the way, the trailing slash on the disallow lines is essential.
posted by Class Goat at 9:04 PM on June 9, 2008


"I'm pretty sure that there's no way to accomplish this that will ultimately make Google happy with you."

Not so. Check soma lkzx's answer, which will work flawlessly until googlebot starts executing Javascript (don't count on it). That's the method I'd recommend.
posted by toomuchpete at 9:16 PM on June 9, 2008


Use CSS magic. You can float content to any page of the page. So the HTML would physically have the content first and links second, but the content would appear on the right side of the page. Nothing an average web designer can't hack together for you in an afternoon.
posted by bprater at 9:19 PM on June 9, 2008


Best answer: Another solution would be to do the non-indexable stuff as an IFrame and load the IFrame from a URL that is excluded from Googlebot.
posted by mmascolino at 9:49 PM on June 9, 2008


Seconding mmascolino's suggestion. This is the cleanest, in my opinion.
posted by zippy at 12:41 AM on June 10, 2008


thirding the Iframe answer, because its the simplest answer that degrades gracefully, without getting into server-side IP sniffing wizardry.
posted by Smoosh Faced Lion at 1:27 PM on June 10, 2008


« Older Ask for credit when credit is due (at work)?   |   Bicycle == Decongestant ? Newer »
This thread is closed to new comments.