Join 3,434 readers in helping fund MetaFilter (Hide)


how do i make my blog invisible?
December 8, 2005 11:43 PM   Subscribe

how do I make my blog be invisible to search engines?

hi,

i want my blog to not be something people pull up with google and the like. i put up a robots.txt file but it didn't seem to help. here is it

# Robots.txt file from http://www.searchengineworld.com
#
# Bans all robots will spider the domain

User-agent: *
Disallow: /

any ideas? thanks.
posted by aussicht to Computers & Internet (20 answers total)
 
Robots.txt is the correct approach. How long ago did you put up your robots.txt file? It won't take effect until Google's spider comes back and re-crawls your site (same for other search engines). I'd wait at least a couple weeks before deciding it hasn't worked.
posted by kindall at 11:50 PM on December 8, 2005


it has been weeks maybe a month or so and it still pulls me up if you search for toddhu
posted by aussicht at 11:59 PM on December 8, 2005


You can also add <meta name="robots" content="noindex,nofollow" /> to your HTML header. It has the same function as the robots.txt, though it does have one other value "noarchive" (which can't be added to robots.txt) that simply tells sites like Google and the Internet Archive not to cache the pages.

However, I've seen things in my site logs that look like robots and spiders either not hitting or ignoring robots.txt, so if you're looking for complete invisibility, I wouldn't assume this is an infallible method, even though it will work for the major search engines.

If you want your site to be really inaccessible to accidental or uninvited visitors, I would suggest putting it on a page that's not the default site index and have no links to it from anywhere and/or using password protection.
posted by camcgee at 12:10 AM on December 9, 2005


Meta Tags will also help, but note that while the official search engines (Google, Yahoo, MSN, etc) will probably respect your robots.txt and Meta tags, there's no ruling saying a search engine or other spider has to. The only for sure way is to limit access to it by password, IP, etc.
posted by gramcracker at 12:12 AM on December 9, 2005


Take "toddhu" out of the address. It looks like Google is honoring your robots.txt, as it provides no information about the URL, but it knows the URL exists and the query is matching the URL.

Note that if a billion sites on the web make a link to your blog with an anchor called "toddhu", Google may still pull you up. Again, it's honoring your request not to be spidered, but it knows a little about you anyway.
posted by trevyn at 12:13 AM on December 9, 2005


aussicht: it has been weeks maybe a month or so and it still pulls me up if you search for toddhu

It looks like Google spidered the page before you had the robots.txt page. Unfortunately, I'm not sure there's any way to get Google to "forget" links once they're in the index. (I can search and get results for my old sites that are long dead, though they only return the domain, just like it's doing with your search.)

One option would be to move the WP install to a new directory, so the Google link would go to 404.
posted by camcgee at 12:15 AM on December 9, 2005


Looks like google has indeed stopped indexing your site. However, if the searchterm appears in the actual URL, sometimes it will show up in Google's index anyway (although with no description or title). I see this on and off for pages that I've excluded via robots.txt, and I'm not exactly sure why.

Google has a manual removal tool that you can try. You can also contact google directly - they're often more helpful than you'd guess.

You should consider changing the URL of your blog to something that doesn't have the searchterm you're trying to avoid. This is a good idea because people will still be able to easily find your site if anyone mentions you in *their* blog.. "toddhu says that..."
posted by helios at 12:16 AM on December 9, 2005


helios -- thanks for the link to that manual removal tool. I had no idea that was available.
posted by camcgee at 12:20 AM on December 9, 2005


givin it a try! thanks so much.
posted by aussicht at 12:36 AM on December 9, 2005


I don't think you've waited long enough. You still have a number of pages in their index. These should go away on the next refresh of the index.
posted by Rhomboid at 1:06 AM on December 9, 2005


Hey aussicht, once you make changes to your page, it can be awhile before Google picks them up. It took what seemed like forever for Google to realize my page didn't have my real name on it anymore. But sooner or later, it will.
posted by ThePinkSuperhero at 4:41 AM on December 9, 2005 [1 favorite]


There are poison pill tactics out there for robots who ignore robots.txt. The administrator at my local LUG uses them.

Essentially, he creates a page that adds the viewer to a blocked list of IPs. The only place that the page is referenced is a disallow in robots.txt. Any robot that reads it, then goes there anyways gets automatically banned.
posted by unixrat at 6:28 AM on December 9, 2005


unixrat, can you link to a write-up of this solution? i definitely have a use for it.
posted by camworld at 7:55 AM on December 9, 2005


I don't know about a write-up, but here's the robots.txt.

You'll see:
[...]
Disallow: /presentations/
Disallow: /webmail/
Disallow: /guestbook.shtml

And if you visit http://norlug.org/guestbook.shtml, the rest of the site will be denied to you.
posted by unixrat at 8:30 AM on December 9, 2005


If you'd like the admin's address, drop me a line.
posted by unixrat at 8:32 AM on December 9, 2005


Google will remove your page if you ask them, but it will take a while.
posted by craniac at 9:13 AM on December 9, 2005


Craniac has it right. If you ask them in e-mail, they'll take it out, but there is a slow bureaucratic period where nothing happens. I did it myself for one of my sites.

Note that not every search engine is going to honor robots.txt OR your requests, no matter how much you work. It's almost impossible to disappear from the Web.
posted by Hildago at 10:44 AM on December 9, 2005


unixrat scribbled "The only place that the page is referenced is a disallow in robots.txt. Any robot that reads it, then goes there anyways gets automatically banned."

Note that this only works for robots that read robots.txt to glean links and then ignore the instructions within. You can do similiar things with links on your main no spider page that aren't visible to users but can be seen by robots. A behaving robot won't index the page because it is in your robots.txt, users won't see the link because it is not visible in some way but spiders that completely ignored the robots.txt will follow the link and bannation will occur. The risk here is web optimizers and screen readers will sometimes see the link or automatically follow it.
posted by Mitheral at 12:09 PM on December 9, 2005


If you're trying to make your blog basically only accessible to people that know the address, could you *not* have an index page? If you tell people "go to www.website.com/blogpage.html", I believe that will pretty much prevent any search engine from cataloging your site...unless someone else is linking to you site, in which case the bots will get in that way.
posted by edjusted at 7:48 PM on December 10, 2005


other people are linking to my site...
posted by aussicht at 10:36 AM on December 11, 2005


« Older UnixFilter: How can I force a ...   |  ArtFilter: Help me find a pain... Newer »
This thread is closed to new comments.