I need to download google!
June 14, 2006 8:04 PM   Subscribe

How do I download google's entire cache of a website that has 137,000 hits?

Ok, so em411.com got redesigned, and em/admin (site administrator) dumped the database. There were some damn good conversations that happened over the last 6 years on that site and there is no way that I am going to be able to remember what each and every one of them was about, so I need to find a way to get the entire cache of this website. Hope me please.

If I do a site:em411.com I get 137,000 hits, but only 1000 of them are accessible via google.

I'm thinking that I could get this done via some combination of wget following only the cache links (how do I do that?) and varying the keyword in the search (not sure what the best ones to pick are)

Oh, and the internet archive has basically nothing of this site.
posted by bigmusic to Computers & Internet (11 answers total)
 
What the crap? I used to visit em411 all the time- that's really lame.

Can't you do:

wget -r http://www.google.com/search?hl=en&q=site%3Aem411.com&btnG=Google+Search
posted by xmutex at 8:35 PM on June 14, 2006


Response by poster: I could do that, but I'm afriad wget would get caught in all the ads and not do what I want it to do.
posted by bigmusic at 8:42 PM on June 14, 2006


i did something similar to this a while ago. i used perl and LWP.

essentially, i believe the script went through each one of the 1000 cached results that google would list, and then made google requests for 'cached:(url)' for each link found in the cached results. i grepped these urls out of the response with simple regular expressions. i also maintained a simple database to prevent the crawler from collecting the same thing twice, i think i just md5 hashed each url i requested, and stored it. as i recall, this did a pretty good job in terms of coverage.

a word of warning though, go slow and steady with it -- i ended up getting myself "banned" from google for a few hours. :) oh, also set the user agent of whatever you use to something that looks like a legitimate browser.
posted by (lambda (x) x) at 8:52 PM on June 14, 2006


er, that should be 'cache:(url)', or more specifically, for example:



http://www.google.com/search?q=cache%3Aask.metafilter.com
posted by (lambda (x) x) at 8:53 PM on June 14, 2006


Rather than archive it yourself, maybe you could rely on the Wayback Machine's copy.
posted by xulu at 8:57 PM on June 14, 2006


Response by poster: Xulu, as I pointed out in my original post the wayback machine doesn't have all the pages.
posted by bigmusic at 9:09 PM on June 14, 2006


Note that the Wayback Machine is slow to publish the most recent snapshots: The latest version visible now is from April 2005, but snapshots taken between then and now will probably appear in the coming months.
posted by mbrubeck at 9:11 PM on June 14, 2006


Looking at the format of the page urls, they are all of the form, em411.com/forum/xxxxx/yyyy

Where xxxxx is the thread and yyyy is the page number.

In order to get just the pages you want, one by one, search with this format,

site:em411.com/forum/xxxxx/yyyy

I didn't figure out the numbering rules or patterns, but you could just start at x~25000 and y=1 and go through the whole lot until the end.
posted by MetaMonkey at 9:39 PM on June 14, 2006


Sorry I missed that part.
posted by xulu at 9:44 PM on June 14, 2006


If you do get it to work, could you zip it and yousendit here as a followup?
posted by StickyCarpet at 7:16 AM on June 15, 2006


Response by poster: I don't know how to code anything, so I don't think I'd be able to script anything together. I was hoping that there was an app for this.
posted by bigmusic at 7:49 AM on June 15, 2006


« Older How do I Remove the Bypass from a new GE...   |   And behind door number two... Newer »
This thread is closed to new comments.