I need to download google!
How do I download google's entire cache of a website that has 137,000 hits?

Ok, so em411.com got redesigned, and em/admin (site administrator) dumped the database. There were some damn good conversations that happened over the last 6 years on that site and there is no way that I am going to be able to remember what each and every one of them was about, so I need to find a way to get the entire cache of this website. Hope me please.

If I do a site:em411.com I get 137,000 hits, but only 1000 of them are accessible via google.

I'm thinking that I could get this done via some combination of wget following only the cache links (how do I do that?) and varying the keyword in the search (not sure what the best ones to pick are)

Oh, and the internet archive has basically nothing of this site.
What the crap? I used to visit em411 all the time- that's really lame.

Can't you do:

wget -r http://www.google.com/search?hl=en&q=site%3Aem411.com&btnG=Google+Search
I could do that, but I'm afriad wget would get caught in all the ads and not do what I want it to do.
i did something similar to this a while ago. i used perl and LWP.

essentially, i believe the script went through each one of the 1000 cached results that google would list, and then made google requests for 'cached:(url)' for each link found in the cached results. i grepped these urls out of the response with simple regular expressions. i also maintained a simple database to prevent the crawler from collecting the same thing twice, i think i just md5 hashed each url i requested, and stored it. as i recall, this did a pretty good job in terms of coverage.

a word of warning though, go slow and steady with it -- i ended up getting myself "banned" from google for a few hours. :) oh, also set the user agent of whatever you use to something that looks like a legitimate browser.
er, that should be 'cache:(url)', or more specifically, for example:

Rather than archive it yourself, maybe you could rely on the Wayback Machine's copy.
Xulu, as I pointed out in my original post the wayback machine doesn't have all the pages.
Note that the Wayback Machine is slow to publish the most recent snapshots: The latest version visible now is from April 2005, but snapshots taken between then and now will probably appear in the coming months.
Looking at the format of the page urls, they are all of the form, em411.com/forum/xxxxx/yyyy

Where xxxxx is the thread and yyyy is the page number.

In order to get just the pages you want, one by one, search with this format,


I didn't figure out the numbering rules or patterns, but you could just start at x~25000 and y=1 and go through the whole lot until the end.
Sorry I missed that part.
If you do get it to work, could you zip it and yousendit here as a followup?
I don't know how to code anything, so I don't think I'd be able to script anything together. I was hoping that there was an app for this.
