Verity documents visible on Google?
January 12, 2006 6:24 AM   Subscribe

How can we make documents stored on our password-protected Verity website visible to Google?

Looking for pointers / solutions / products on the following:

We have a large body of PDFs documents on a Verity database which is on the web, but password protected.

The solution we're looking for would make those PDFs visible to Google and other search engines so that links to our website are produced by searches with the appropriately specific keywords. Those with valid password cookies could click on the google link and directly get the PDF. Those without would get redirected to our login page. "No cache" instructions would prevent Google from directly serving the PDF (or an HTML mirror of it) to those without privileges.
posted by MattD to Computers & Internet (6 answers total)
 
Assuming you're using apache, you could check to see if the user agent string contains "Googlebot 2.1" (this would work for google, I'm sure Yahoo, etc have similiar unique user agents)

This would allow access to anyone who had faked his user agent as well, but only a small percentage of web users do this.
posted by null terminated at 6:37 AM on January 12, 2006


Wouldn't creating an open regular HTML page for each document, containing an abstract for instance, solve the problem? If the person has the cookie, a script can then redirect the request directly the same way you describe. If not the person sees the abstract page (with a nice login box on the side).
posted by nkyad at 6:46 AM on January 12, 2006


Can't you get blacklisted from Google for serving them different information than average web users? This might have changed though. According to a blog post about Google and cloaking G's FAQ used to say:
The term "cloaking" is used to describe a website that returns altered webpages to search engines crawling the site. In other words, the webserver is programmed to return different content to Google than it returns to regular users, usually in an attempt to distort search engine rankings. This can mislead users about what they'll find when they click on a search result. To preserve the accuracy and quality of our search results, Google may permanently ban from our index any sites or site authors that engage in cloaking to distort their search rankings.
but I can't find that information in their current FAQs. Right now, the only thing I can find is in their guidelines:
Make pages for users, not for search engines. Don't deceive your users or present different content to search engines than you display to users, which is commonly referred to as "cloaking."
posted by revgeorge at 6:55 AM on January 12, 2006


Previous questions on the subject:

How does Google collect content that is subscription-only?

I have been noticing more and more websites using Google to index their restricted content.

Some good answers in there. The way to do it seems to be based on the UserAgent string and the known Google IP addresses.

Also note that this may violate Google's guidelines and could result in you getting banned from the index.
posted by blag at 7:30 AM on January 12, 2006


revgeorge: try this page:
However, certain actions such as cloaking, writing text that can be seen by search engines but not by users, or setting up pages/links with the sole purpose of fooling search engines may result in permanent removal from our index.
posted by blag at 7:33 AM on January 12, 2006


It should also be noted that this type of thing is done all the time and by technology heavyweights (see ACM)
posted by null terminated at 8:05 AM on January 12, 2006


« Older Salt Differences   |   MP3 player for a backpack Newer »
This thread is closed to new comments.