How to scrape the Google cache
May 18, 2012 8:52 AM   Subscribe

Had a plan to scrape a website, but now it's down indefinitely. Google has the site cached, but this makes things kind of complicated. Newbie questions about scraping websites and using the Google cache inside.

I've read this question - should I be trying to use warrick?

I'm starting a small side project that involves scraping a site and doing some analysis/visualization of the resulting dataset. I'm planning on using Python and Beautiful Soup to do the scraping - the site is laid out very consistently and looks like it'd be easy to scrape/parse and would make for a good learning exercise (I'm new to this, so apologies for any incorrect terminology).

Unfortunately, the timing of this idea coincided with the very recent (permanent) shutdown of the site's servers. I still want to follow through with this, but this additional complication has thrown a bit of the wrench into the works. Google has the site cached, and I'd like to see if I can piece together data using what's archived.

The site is basically structured like a blog with many many posts. The first page shows the 20 latest posts, with older posts pushed back to subsequent pages. The URL structure is static and follows this format:
If post 1 is the most recent post, at any given time, "website.com/page/2/" shows posts 21-40, "website.com/page/3/" shows posts 41-60, and so on. When a new post is submitted, it becomes post 1, and everything is pushed back one.
Posting frequency is probably ~30 posts/day. My problem is that not all of the pages were last crawled on the same date, resulting in some overlaps and some gaps in data due to the way the page content shifts between archive dates. I'm not concerned with overlaps or having the most recent crawl, but in an ideal world I'd have dataset that covers all posts that were made in the timeframe of ~1 year with no gaps in dates. I don't think this is possible, though.

1) Can I scrape the google cache the same way I'd scrape the original site? (use urllib and point the script to "http://webcache.googleusercontent.com/search?q=cache:http://website.com/page/2/") for all existing pages.
According to this, it looks crawling the google cache is violating their Terms of Use. What's the next best alternative? Is using --wait=seconds kosher?
(The Way Back Machine doesn't have nearly as much archived as Google does.)

2) Is there any way to access earlier cached versions of the same url or are they overwritten? My reasoning is that I could perhaps eliminate some of the gaps if I could scrape all versions of "website.com/page/3/" and just get rid of duplicate entries (because they'd inevitably end up on page 4 or 5 in a later crawl).

Like I said, I'm also new to this whole area, so I'd also like to use this post as a sanity check - is anything I'm saying here wrong/impossible/etc? Any other advice?

Thanks for your help!
posted by hot soup to Computers & Internet (4 answers total) 3 users marked this as a favorite
 
You may want to check on archive.org's WayBackMachine as well which could help cover question 2) above.
posted by samsara at 9:27 AM on May 18, 2012


I just tried curl and Python's urllib2, and got a 403 Forbidden error with both trying to retrieve from the Google cache. You'll need to change/spoof the User-Agent.

Depending on how the page was originally structured, you'll need to do some URL rewriting when you extract data from the cached post listing page. If the original page was coded with relative links, you'll need to add both the Google cache preamble as well as the site URL. If they are absolute URLs, you'll need to prepend just the Google cache prefix. Note that Google adds a <base> element in the <head> pointing to the original URL of the content, which means that in a browser relative URLs will try to go to the site instead of the Google cache. But if you're scraping, you probably won't see that behavior because things like BeautifulSoup don't interpret such tags, as far as I know.

If Google does internally store multiple revisions of the same page, they don't make them available in any way, AFAIK, so what you see is what you get.
posted by Rhomboid at 10:47 AM on May 18, 2012


Are you sure that that's the only archive URL scheme that's available? For instance, if the site were published using tumblr, there would also be canonical monthly archives like "website.com/archive/2012/1"...this would at least simplify the issue with duplicate posts.
posted by bcwinters at 10:53 AM on May 18, 2012


Warrick's gotten a bit of an overhaul - it won't work with the Google cache, I think, but it can use Memento interfaces with a variety of archives, not just the Wayback Machine.

It's probably best to let Warrick do its job, then let your software work on local files.
posted by Pronoiac at 12:18 AM on May 19, 2012


« Older I'd like to smoke outside, but I don't want to...   |   Help me take payments from future customers by... Newer »
This thread is closed to new comments.