Retrieving web content from Google cache?
November 24, 2005 4:20 AM   Subscribe

Is it likely that Google would only cache one page out of a multi-page article?

I've been searching for online reference material for a post I'm writing. I've discovered what looks to be an interesting article dealing with the subject I'm writing about, and I would love to read it in its entirety. Unfortunately, it appears the article is no longer available on the site in question, and Google only returns a cache of one page from the article (which is spread over 8 pages). I've tried a couple of different techniques suggested on various websites for explicitly requesting other pages from the article from Google's cache, without any luck (and also no luck from archive.org). So, I'm wondering: is Google's cache a little haphazard in this way? Are there any other web cache repositories out there that might be worth trying, or do I have to accept that the majority of the article is simply out of my reach?
posted by planetthoughtful to Computers & Internet (9 answers total)
 
i doubt, cool as google is, that it knows about "articles". it probably works with clustering statistics that will generally group things like articles together, but, unlike a human, it's not got the intelligence to really read and understand the links and so be sure that a group of pages are related in the way we understand by "an article".

so it probably works in a way that tends to index/cache whole articles, but i doubt that it can guarantee it (unless it's hand-tuned for a few particular sites).

i guess you've tried looking for the cache of any "printer friendly" page? and the wayback machine (is that the same as archive.org?)? maybe if you posted the url other people could try too?
posted by andrew cooke at 4:30 AM on November 24, 2005


in case you don't succeed fishing the cache of your article: a method that has helped retrieve articles no longer on the net is to find the author (google) and ask (email) for a copy.
posted by mirileh at 5:03 AM on November 24, 2005


Response by poster: The link to the cached page itself is here.

As you can see, if you visit the link, there are 7 other pages in the article, but I haven't been able to find any way of retrieving them.

Many thanks to anyone who gives it some thought.
posted by planetthoughtful at 9:14 AM on November 24, 2005


It depends on a lot of things I'd imagine:

- the robots.txt of the site
- presence of rel="nofollow" or other meta tags that tell spiders to go away
- how well the site is internally linked
- needing an account/cookies
- captchas
- etc

If a webmaster wanted to deliberately make it such that google only cached the first page then it would be nearly trivial.

But in this case from looking at the link it's clear that this is an excerpt from a academic journal and the full-text articles of those are usually not freely available. So I would not be surprised that you can't find the whole thing. The page that you linked looks like it was a copy saved by someone who was using a PC in a university library that had a subscription, and then they used furl to save it. In other words, it's a fluke.
posted by Rhomboid at 9:59 AM on November 24, 2005


Response by poster: The page that you linked looks like it was a copy saved by someone who was using a PC in a university library that had a subscription, and then they used furl to save it. In other words, it's a fluke.

Actually, I think that's a LookSmart feature offered on every page you view in LS. I.E., you can save a link to the page / article in furl while viewing pages on one of the LS sites.

What threw me was the fact that it's the 2nd page of the article. So, I assumed that many of the caveats you have above wouldn't be applicable, unless they've got a very odd cache policy / approach.

But thanks for the observations all the same.
posted by planetthoughtful at 11:17 AM on November 24, 2005


The homepage of Elizabeth Tucker, who wrote the article you’re looking for. As mirileh suggests, you can email Dr. Tucker to ask for a copy.
posted by cgc373 at 11:27 AM on November 24, 2005


Searching in Google for inurl:ai_n13637915 brings up also page 5 (which are the only pages that Google has in its index). Due to some bug on Google, clicking on the cached link doesn't seem to work at the moment for some reason.
posted by Sharcho at 1:23 PM on November 24, 2005


The original article is in here, but it can only be accessed from one the subscribed instutions.
I have access, and I can e-mail you the pdf, if you want? It's 256 kb.
posted by easternblot at 1:53 PM on November 24, 2005


You can download here via the Coralized link .
posted by Sharcho at 5:50 PM on November 24, 2005


« Older My wireless connection keeps on dying. Please fix...   |   How do I find this ebook? Newer »
This thread is closed to new comments.