Had a plan to scrape a website, but now it's down indefinitely. Google has the site cached, but this makes things kind of complicated. Newbie questions about scraping websites and using the Google cache inside.
I've read this question
- should I be trying to use warrick?
I'm starting a small side project that involves scraping a site and doing some analysis/visualization of the resulting dataset. I'm planning on using Python and Beautiful Soup to do the scraping - the site is laid out very consistently and looks like it'd be easy to scrape/parse and would make for a good learning exercise (I'm new to this, so apologies for any incorrect terminology).
Unfortunately, the timing of this idea coincided with the very recent (permanent) shutdown of the site's servers. I still want to follow through with this, but this additional complication has thrown a bit of the wrench into the works. Google has the site cached, and I'd like to see if I can piece together data using what's archived.
The site is basically structured like a blog with many many posts. The first page shows the 20 latest posts, with older posts pushed back to subsequent pages. The URL structure is static and follows this format:
If post 1 is the most recent post, at any given time, "website.com/page/2/" shows posts 21-40, "website.com/page/3/" shows posts 41-60, and so on. When a new post is submitted, it becomes post 1, and everything is pushed back one.
Posting frequency is probably ~30 posts/day. My problem is that not all of the pages were last crawled on the same date, resulting in some overlaps and some gaps in data due to the way the page content shifts between archive dates. I'm not concerned with overlaps or having the most recent crawl, but in an ideal world I'd have dataset that covers all posts that were made in the timeframe of ~1 year with no gaps in dates. I don't think this is possible, though.
1) Can I scrape the google cache the same way I'd scrape the original site? (use urllib and point the script to "http://webcache.googleusercontent.com/search?q=cache:http://website.com/page/2/") for all existing pages.
According to this
(The Way Back Machine doesn't have nearly as much archived as Google does.)
2) Is there any way to access earlier cached versions of the same url or are they overwritten? My reasoning is that I could perhaps eliminate some of the gaps if I could scrape all versions of "website.com/page/3/" and just get rid of duplicate entries (because they'd inevitably end up on page 4 or 5 in a later crawl).
Like I said, I'm also new to this whole area, so I'd also like to use this post as a sanity check - is anything I'm saying here wrong/impossible/etc? Any other advice?
Thanks for your help!