How do I block Googlebot and other bots from indexing a certain part of a webpage?
June 9, 2008 7:04 PM
How do I block Googlebot and other bots from indexing a certain part of a webpage?
I run an online music magazine. On every article, show review and feature, we use one column on the webpage to list 15-20 upcoming shows and venue information. Unfortunately, due to the repetitive text, Google inevitably thinks each webpage is about the 15-20 shows listed (particularly the repeated venue information) and not about the article, show review or feature.
We're now using other techniques to emphasize the content in the articles (title tags, header tags, meta description/keywords, bolding) but I'd really like it if I could exclude the column with the upcoming shows and venue information from being indexed. If that could be achieved, Google might instead focus on the true content and the keywords inside!
Any ideas?
I run an online music magazine. On every article, show review and feature, we use one column on the webpage to list 15-20 upcoming shows and venue information. Unfortunately, due to the repetitive text, Google inevitably thinks each webpage is about the 15-20 shows listed (particularly the repeated venue information) and not about the article, show review or feature.
We're now using other techniques to emphasize the content in the articles (title tags, header tags, meta description/keywords, bolding) but I'd really like it if I could exclude the column with the upcoming shows and venue information from being indexed. If that could be achieved, Google might instead focus on the true content and the keywords inside!
Any ideas?
Apparently, you can just add a META tag (I didn't know they were relevant) in your <HEAD> section:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
Another way, if your host supports it, is through
posted by spiderskull at 7:12 PM on June 9, 2008
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
Another way, if your host supports it, is through
.htaccess
posted by spiderskull at 7:12 PM on June 9, 2008
Robots.txt and .htaccess seem only valid if you want to block bots from indexing your *entire* page. I want to exclude only a certain part of the page.
Am I missing something?
posted by jrholt at 7:20 PM on June 9, 2008
Am I missing something?
posted by jrholt at 7:20 PM on June 9, 2008
You can't do that with just "robots.txt" or the .htaccess file. The only way to do it is to have 2 versions of your page. In the code of the page, you have to detect the type of browser and if the browser signature is Googlebot, you don't display certain parts of the page. It's called cloaking.
posted by McSly at 7:27 PM on June 9, 2008
posted by McSly at 7:27 PM on June 9, 2008
" I want to exclude only a certain part of the page. Am I missing something?"
Don't serve that part of the page to Google. The address ranges and User-Agent headers are easily recognizable, just serve up that section of the page conditionally.
posted by majick at 7:27 PM on June 9, 2008
Don't serve that part of the page to Google. The address ranges and User-Agent headers are easily recognizable, just serve up that section of the page conditionally.
posted by majick at 7:27 PM on June 9, 2008
Just wondering: would using "nofollow" on the pages with the show lists (or rel=nofollow on the show links) reduce the effect those links have on the page's overall rank?
posted by Lazlo at 7:33 PM on June 9, 2008
posted by Lazlo at 7:33 PM on June 9, 2008
Oh I see, I misunderstood the question. Does anyone here know if the Google bots run Javascript at all? Because if they don't, you can look into adding the show information via a client-side include.
Scroll down about midway on this page, which shows how you can include other files. Then make the included file blocked via .htaccess or robots.txt.
posted by spiderskull at 7:39 PM on June 9, 2008
Scroll down about midway on this page, which shows how you can include other files. Then make the included file blocked via .htaccess or robots.txt.
posted by spiderskull at 7:39 PM on June 9, 2008
use javascript to write the non-indexable info to the page on the fly! the googlebot won't get that.
posted by soma lkzx at 8:15 PM on June 9, 2008
posted by soma lkzx at 8:15 PM on June 9, 2008
I'm pretty sure that there's no way to accomplish this that will ultimately make Google happy with you. You might find a method that'll work for a time, but the bot is ever-evolving, and when Google finds that you're trying to hide things from Googlebot, I gather they go a bit nuclear on your pagerank under the assumption that you have ulterior motives. You don't want that to happen, obviously. Better if you continue the white hat methods you're currently using.
Googlebot makes some assumptions about content importance based on its position within the document. Closer to the top = more topical, closer to the bottom = not so much. So, if you can make the column of gigs one of the very last things in your html, and then position it where you want it via CSS, that might help to reduce its importance in the eyes of Googlebot.
If you're not already, get hooked up with Google Webmaster Tools and start submitting site maps, to help the bot find its way around. Also read the Google Webmaster Central Blog.
When I was running Google Ads, there was a way to assign page areas different weights, to help the AdSense engine determine what your page was most about, so that it could more closely target the ads. I'm not sure if Googlebot pays any attention to that or not, but it might be worth investigating.
If all else fails, you might consider presenting the textual upcoming show content in a non-text format that the bot can't parse. Graphic, Flash, etc. It's certainly not ideal, but you could develop something to automate image generation server-side.
posted by mumkin at 8:48 PM on June 9, 2008
Googlebot makes some assumptions about content importance based on its position within the document. Closer to the top = more topical, closer to the bottom = not so much. So, if you can make the column of gigs one of the very last things in your html, and then position it where you want it via CSS, that might help to reduce its importance in the eyes of Googlebot.
If you're not already, get hooked up with Google Webmaster Tools and start submitting site maps, to help the bot find its way around. Also read the Google Webmaster Central Blog.
When I was running Google Ads, there was a way to assign page areas different weights, to help the AdSense engine determine what your page was most about, so that it could more closely target the ads. I'm not sure if Googlebot pays any attention to that or not, but it might be worth investigating.
If all else fails, you might consider presenting the textual upcoming show content in a non-text format that the bot can't parse. Graphic, Flash, etc. It's certainly not ideal, but you could develop something to automate image generation server-side.
posted by mumkin at 8:48 PM on June 9, 2008
Here's what my robots.txt file looks like:
User-Agent: *
Disallow: /
User-Agent: Googlebot
Disallow: /magicdir1/
Disallow: /magicdir2/
Allow: /
User-Agent: MSNBot
Disallow: /magicdir1/
Disallow: /magicdir2/
Allow: /
User-Agent: AskJeeves
Disallow: /magicdir1/
Disallow: /magicdir2/
Allow: /
The first two lines kill all compliant bots except the three I explicitly permit. ("AskJeeves" is answer.com.) You can have as many disallow lines as you want, followed by the "Allow: /" and the bot in question will avoid all the disallows and do everything else.
The path is from the web root. On my linux server, that's "/home/groups/home/web". So if the above was interpreted literally, it would exclude /home/groups/home/web/magicdir1/ and /home/groups/home/web/magicdir2/ but permit the Googlebot to see everything else in the web directory.
It isn't a perfect solution. Many bots ignore the robots.txt file entirely. Oddly, some of them read it, and then ignore it. But my experience is that those three are well behaved, because I've stopped getting search hits on the directories I excluded this way from all three of them.
posted by Class Goat at 9:03 PM on June 9, 2008
User-Agent: *
Disallow: /
User-Agent: Googlebot
Disallow: /magicdir1/
Disallow: /magicdir2/
Allow: /
User-Agent: MSNBot
Disallow: /magicdir1/
Disallow: /magicdir2/
Allow: /
User-Agent: AskJeeves
Disallow: /magicdir1/
Disallow: /magicdir2/
Allow: /
The first two lines kill all compliant bots except the three I explicitly permit. ("AskJeeves" is answer.com.) You can have as many disallow lines as you want, followed by the "Allow: /" and the bot in question will avoid all the disallows and do everything else.
The path is from the web root. On my linux server, that's "/home/groups/home/web". So if the above was interpreted literally, it would exclude /home/groups/home/web/magicdir1/ and /home/groups/home/web/magicdir2/ but permit the Googlebot to see everything else in the web directory.
It isn't a perfect solution. Many bots ignore the robots.txt file entirely. Oddly, some of them read it, and then ignore it. But my experience is that those three are well behaved, because I've stopped getting search hits on the directories I excluded this way from all three of them.
posted by Class Goat at 9:03 PM on June 9, 2008
By the way, the trailing slash on the disallow lines is essential.
posted by Class Goat at 9:04 PM on June 9, 2008
posted by Class Goat at 9:04 PM on June 9, 2008
"I'm pretty sure that there's no way to accomplish this that will ultimately make Google happy with you."
Not so. Check soma lkzx's answer, which will work flawlessly until googlebot starts executing Javascript (don't count on it). That's the method I'd recommend.
posted by toomuchpete at 9:16 PM on June 9, 2008
Not so. Check soma lkzx's answer, which will work flawlessly until googlebot starts executing Javascript (don't count on it). That's the method I'd recommend.
posted by toomuchpete at 9:16 PM on June 9, 2008
Use CSS magic. You can float content to any page of the page. So the HTML would physically have the content first and links second, but the content would appear on the right side of the page. Nothing an average web designer can't hack together for you in an afternoon.
posted by bprater at 9:19 PM on June 9, 2008
posted by bprater at 9:19 PM on June 9, 2008
Another solution would be to do the non-indexable stuff as an IFrame and load the IFrame from a URL that is excluded from Googlebot.
posted by mmascolino at 9:49 PM on June 9, 2008
posted by mmascolino at 9:49 PM on June 9, 2008
Seconding mmascolino's suggestion. This is the cleanest, in my opinion.
posted by zippy at 12:41 AM on June 10, 2008
posted by zippy at 12:41 AM on June 10, 2008
thirding the Iframe answer, because its the simplest answer that degrades gracefully, without getting into server-side IP sniffing wizardry.
posted by Smoosh Faced Lion at 1:27 PM on June 10, 2008
posted by Smoosh Faced Lion at 1:27 PM on June 10, 2008
This thread is closed to new comments.
posted by knave at 7:12 PM on June 9, 2008