How does Google snag restricted pdf content?
December 30, 2005 11:13 PM   Subscribe

How does Google collect content that is subscription-only? I've found that I can often read the cached html versions of pdf files from subscription-only sites even when I can't get the pdf itself. How do they do that?
posted by shoos to Computers & Internet (12 answers total)
Basically, robots.txt. If you told the site you were a web spider, instead of a browser, you could probably see the content.

It's how they get people to feel there's something there to buy a subscription to.
posted by dhartung at 11:43 PM on December 30, 2005

Robots.txt can only keep spiders out; it cannot permit them to view content that would otherwise be forbidden to a human surfer. So that is really not relevent here.

In this case the server has to examine the user-agent. Many sites that have restricted content will bypass those restrictions if the user-agent is "Googlebot/2.1" or similar. If you use Firefox there is a an extension (appropriately named User Agent Switcher) that lets you set this value in your browser on the fly, so you can test this phenominon if you encounter such a site.

However, this little tidbit has been known for quite some time, and I wouldn't be surprised if some sites out there long ago stopped blindly trusting the user-agent string -- since after all it is user supplied data and can be set to anything. Since the netblocks that google's spider uses are relatively easy to determine from looking at the logs I wouldn't be surprised at all if some sites require both the googlebot user-agent as well as the googlebot netblock now.
posted by Rhomboid at 12:10 AM on December 31, 2005

And just so everyone is clear... it's not Google itself that is doing anything, it is that the sites themselves are trying to recognise the google spider when it indexes the site and grant it the desired access. So the question should not be "how does google do this" it should be "how (and why) do the sites in question do this".
posted by Rhomboid at 12:12 AM on December 31, 2005

In the same vein that it is simple to change your user-agent, said sites may have just switched to look at the remote host and check to see if it begins with one of Google's IP blocks. It's rather well-known which IP addresses are used by the bots, and IPs prove much harder to spoof than user agents.

Just blind conjecture, based on the other comments here.
posted by disillusioned at 3:26 AM on December 31, 2005

This is called cloaking. You, as a content provider, try to find a way to offer the Googlebot the full text of your article but leave the rest of it behind a paywall for humans. You can see this in action in this page of Google search results where Nature explicitly says that it provides the full text of the article to the Googlebot. This technique is generally thought of as an uncool thing to do and the Google FAQ says this

"Make pages for users, not for search engines. Don't deceive your users or present different content to search engines than you display to users, which is commonly referred to as "cloaking."

and also this in the "why was my page removed?" section.

"However, certain actions such as cloaking, writing text that can be seen by search engines but not by users, or setting up pages/links with the sole purpose of fooling search engines may result in permanent removal from our index."

I'm not sure if it's okay to do this if you explicitly say that you do, a la Nature. You can see another example where this set of search results gets you to this page where, if you view the source code, you'll see excerpts of the article (especially ones with the keywords in them) hidden in over 150 invisible divs. This is sort of a brute force way of doing this, but it's still presenting one form of the site to the Googlebot and one form to the human viewer of the site and is frowned upon.
posted by jessamyn at 5:28 AM on December 31, 2005

I'm not clear on why it's considered uncool. I mean, sure, ideally everybody could read everything, but it's silly to expect journals that charge huge amounts for subscriptions to put everything up for free on the internet; isn't it better to at least be able to get glimpses via Google and have some idea of what's there (so you can find the journal in the library if need be) than to have it completely off the radar of the internet? What's the thinking on Google's part (or on others who feel the same way)?
posted by languagehat at 7:16 AM on December 31, 2005

languagehat writes "What's the thinking on Google's part"

If another search engine figures a way to automatically remove cloaked content and get know for that ability they could have an adantage over Google. I know it pisses me off to no end to click thru to a page that has nothing to do with what the search results indicated. I see this kind of stuff out of link farms much more so than out of subscription sites.
posted by Mitheral at 7:36 AM on December 31, 2005

I think some of the scholarly fulltext that Google indexes is the result of this CrossRef-Google pilot program. Nature's is listed as one of the participants-- so it's all actually sanctioned by Google.
posted by Carol O at 7:40 AM on December 31, 2005

I think it's also part of the search engine optimization industry vs Google issue. Cloaking fee-based content is one issue [and the issue is more a usabiltiy one, I don't want high-ranking results in Google to content that I can't see/obtain, mostly because my take on Google is that it's indexing content on the web not stuff that's behind a pay wall. If Google changes their tune about that, I could accept it but for now I expect results to point to things I can access.] but search engine optimization which is another more nefarious way that sites find ways to end run Google's search algorithms. When I search for, say Bethel Vermont where I live, most of the links I find via Google are just to those crummy sites that just have some sort of atlas-based list of everyplace USA. "Find a florist in Bethel Vermont!" even though there are no florists here, not one. Since Google is in the business of selling ads, they want to keep people on their site and people doing crafty things to appear high in "relevance" rankings when they're not at all relevant is unuseful to Google long-term.
posted by jessamyn at 7:48 AM on December 31, 2005

It's an uncool thing to do because it's mixed in with open, free content on the Web and wastes my time discovering there's a wall there. It would be okay if Google marked the subscription content in some way (hell, Google News does it for registration-based viewing), so that I could decide on the spot without wasting any time.
posted by rolypolyman at 9:01 AM on December 31, 2005 [1 favorite]

Also a related post.
posted by rolypolyman at 9:04 AM on December 31, 2005

Some sites use a Javascript-based authorization process - the Javascript looks for your subscriber cookie, and if it doesn't exist you're redirected to the login page.

Of course, that makes it very simple to bypass - just turn off Javascript in your browser. Knowing that can save serious dollars (one site I use this trick on normally costs CAD 90 to subscribe to).
posted by lowlife at 12:42 PM on December 31, 2005

« Older Is there a site search script or service that will...   |   WatersFilter Newer »
This thread is closed to new comments.