What's the best practice for authenticating Google's crawlers?
September 17, 2008 10:27 AM   Subscribe

I manage a website. Some content requires authentication by password or IP. I want Google to crawl that content but not cache it, so that Google users can find it in searches but can't access it without authentication. What's the best way to do this? In 2006 Matt Cutts recommended doing a reverse DNS lookup to verify that a bot's name is in the googlebot.com domain, and then a forward DNS->IP lookup using that googlebot.com name (to thwart spoofers). Is that still the best solution? How do other people manage this?
posted by futility closet to Computers & Internet (9 answers total) 1 user marked this as a favorite
Doing this is explicitly against Google policy.
posted by dmd at 10:48 AM on September 17, 2008

Most people seem to use the useragent string (like "Googlebot/2.1"), but some also factor in the IP address. Then they add the "nocache" metatag. That behavior is called "cloaking." It is rumored that Google has stealthy bots which are not labeled as such, just looking for people who infringe. At that point, it's an arms race between you (and the SEO community) and Google.

First off, this practice is sometimes used by hacked-up pharma-farm-link sites to generate a high PageRank for their Viagra spam. When Google looks at the site, they rank for the Viagra spam. When anyone else (like the owner of the victim machine) looks, they don't see that their site has been owned.

The second reason is to lure people into a toll-site. Frankly, if site owners continue this practice, I would like to propose the addition of the "badnetizen" meta tag, just so I can make a setting in Google to avoid all of those sites on general principle, as I loathe being lured into a paywalled site.

If Google finds you have been cloaking, you will get hammered for it. Badly. Also, people on Metafilter might object. Some folks specifically exclude sites which engage in this practice.
posted by adipocere at 10:58 AM on September 17, 2008

Yeah that'd be cloaking and Google frowns upon it but, in some special cases, have let it go for sites that are big enough (think of the New York Times, which requires registration to reach archived articles).
posted by wangarific at 11:24 AM on September 17, 2008

I should clarify -- this is for a scientific magazine, essentially following the NYT model. Nothing nefarious, we just want people to be able to find our archived content.
posted by futility closet at 11:32 AM on September 17, 2008

Google supports something like this via their First Click Free feature.
posted by Good Brain at 11:55 AM on September 17, 2008

What you're trying to do is cloaking and is against the guidelines:
If the file that Googlebot sees is not identical to the file that a typical user sees, then you're in a high-risk category.
-- Official google webmaster blog
The fact that NYT seems to get away with it because they're large does not change the fact that it's still a very hostile and user-unfriendly thing to do.
posted by Rhomboid at 11:55 AM on September 17, 2008

Actually, First Click Free is not so much a feature as it is a policy.

If you serve pages to google that aren't visible to the general public without authentication, you won't fall afoul of their policy against cloaking as long as you provide the whole page to users without authentication when they visit via a link on Google's search results page.

I believe that this is essentially what the NYT is doing now, no matter where the link in comes from you can read the whole article once, but if you return, you have to sign-in.
posted by Good Brain at 12:03 PM on September 17, 2008

I just can't stop.

Given your focus, you might see if you can get listed in Google Scholar's index. It looks like their policy is a little different. You may be able to have them index the full text of your article, while only displaying an abstract to users. Plus, it looks like items from the Google Scholar index may be included in the general search results.
posted by Good Brain at 12:08 PM on September 17, 2008

Sounds like First Click Free or Google Scholar is what I want. Thanks, everyone.
posted by futility closet at 1:01 PM on September 17, 2008

« Older Slightly morbid bird question   |   Ground Apples: To the apple-cider heap? Newer »
This thread is closed to new comments.