<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

	<title>Comments on: I need to download google!</title>
	<link>http://ask.metafilter.com/40203/I-need-to-download-google/</link>
	<description>Comments on Ask MetaFilter post I need to download google!</description>
	<pubDate>Wed, 14 Jun 2006 20:35:59 -0800</pubDate>
	<lastBuildDate>Wed, 14 Jun 2006 20:35:59 -0800</lastBuildDate>
	<language>en-us</language>
	<docs>http://blogs.law.harvard.edu/tech/rss</docs>
	<ttl>60</ttl>

	<item>
		<title>Question: I need to download google!</title>
		<link>http://ask.metafilter.com/40203/I-need-to-download-google</link>	
		<description>How do I download google&apos;s entire cache of a website that has 137,000 hits? &lt;br /&gt;&lt;br /&gt; Ok, so &lt;a href=&quot;http://www.em411.com&quot;&gt;em411.com&lt;/a&gt; got redesigned, and em/admin (site administrator) dumped the database. There were some damn good conversations that happened over the last 6 years on that site and there is no way that I am going to be able to remember what each and every one of them was about, so I need to find a way to get the entire cache of this website. Hope me please.&lt;br&gt;
&lt;br&gt;
If I do a &lt;a href=&quot;http://www.google.com/search?hl=en&amp;q=site%3Aem411.com&amp;btnG=Google+Search&quot;&gt;site:em411.com&lt;/a&gt; I get 137,000 hits, but only 1000 of them are accessible via google. &lt;br&gt;
&lt;br&gt;
I&apos;m thinking that I could get this done via some combination of wget following only the cache links (how do I do that?) and varying the keyword in the search (not sure what the best ones to pick are)&lt;br&gt;
&lt;br&gt;
Oh, and the internet &lt;a href=&quot;http://web.archive.org/web/*sa_/http://www.em411.com&quot;&gt;archive&lt;/a&gt; has basically nothing of this site.</description>
		<guid isPermaLink="false">post:ask.metafilter.com,2006:site.40203</guid>
		<pubDate>Wed, 14 Jun 2006 20:04:04 -0800</pubDate>
		<dc:creator>bigmusic</dc:creator>
		
			<category>google</category>
		
			<category>spider</category>
		
			<category>wget</category>
		
			<category>howto</category>
		
			<category>help</category>
		
	</item> <item>
		<title>By: xmutex</title>
		<link>http://ask.metafilter.com/40203/I-need-to-download-google#619545</link>	
		<description>What the crap? I used to visit em411 all the time- that&apos;s really lame.&lt;br&gt;
&lt;br&gt;
Can&apos;t you do:&lt;br&gt;
&lt;br&gt;
wget -r http://www.google.com/search?hl=en&amp;amp;q=site%3Aem411.com&amp;amp;btnG=Google+Search</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.40203-619545</guid>
		<pubDate>Wed, 14 Jun 2006 20:35:59 -0800</pubDate>
		<dc:creator>xmutex</dc:creator>
	</item><item>
		<title>By: bigmusic</title>
		<link>http://ask.metafilter.com/40203/I-need-to-download-google#619548</link>	
		<description>I could do that, but I&apos;m afriad wget would get caught in all the ads and not do what I want it to do.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.40203-619548</guid>
		<pubDate>Wed, 14 Jun 2006 20:42:14 -0800</pubDate>
		<dc:creator>bigmusic</dc:creator>
	</item><item>
		<title>By: (lambda (x) x)</title>
		<link>http://ask.metafilter.com/40203/I-need-to-download-google#619557</link>	
		<description>i did something similar to this a while ago. i used perl and &lt;a href=&quot;http://search.cpan.org/~gaas/libwww-perl-5.805/lib/LWP.pm&quot;&gt;LWP&lt;/a&gt;.&lt;br&gt;
&lt;br&gt;
essentially, i believe the script went through each one of the 1000 cached results that google would list, and then made google requests for &apos;cached:(url)&apos; for each link found in the cached results. i grepped these urls out of the response with simple regular expressions. i also maintained a simple database to prevent the crawler from collecting the same thing twice, i think i just md5 hashed each url i requested, and stored it. as i recall, this did a pretty good job in terms of coverage.&lt;br&gt;
&lt;br&gt;
a word of warning though, go slow and steady with it --  i ended up getting myself &quot;banned&quot; from google for a few hours. :) oh, also set the user agent of whatever you use to something that looks like a legitimate browser.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.40203-619557</guid>
		<pubDate>Wed, 14 Jun 2006 20:52:26 -0800</pubDate>
		<dc:creator>(lambda (x) x)</dc:creator>
	</item><item>
		<title>By: (lambda (x) x)</title>
		<link>http://ask.metafilter.com/40203/I-need-to-download-google#619558</link>	
		<description>er, that should be &apos;cache:(url)&apos;, or more specifically, for example:&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
http://www.google.com/search?q=cache%3Aask.metafilter.com</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.40203-619558</guid>
		<pubDate>Wed, 14 Jun 2006 20:53:31 -0800</pubDate>
		<dc:creator>(lambda (x) x)</dc:creator>
	</item><item>
		<title>By: xulu</title>
		<link>http://ask.metafilter.com/40203/I-need-to-download-google#619564</link>	
		<description>Rather than archive it yourself, maybe you could rely on the &lt;a href=&quot;http://web.archive.org/web/*/http://www.em411.com/&quot;&gt;Wayback Machine&apos;s copy&lt;/a&gt;.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.40203-619564</guid>
		<pubDate>Wed, 14 Jun 2006 20:57:45 -0800</pubDate>
		<dc:creator>xulu</dc:creator>
	</item><item>
		<title>By: bigmusic</title>
		<link>http://ask.metafilter.com/40203/I-need-to-download-google#619572</link>	
		<description>Xulu, as I pointed out in my original post the wayback machine doesn&apos;t have all the pages.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.40203-619572</guid>
		<pubDate>Wed, 14 Jun 2006 21:09:39 -0800</pubDate>
		<dc:creator>bigmusic</dc:creator>
	</item><item>
		<title>By: mbrubeck</title>
		<link>http://ask.metafilter.com/40203/I-need-to-download-google#619573</link>	
		<description>Note that the Wayback Machine is slow to publish the most recent snapshots:  The latest version visible now is from April 2005, but snapshots taken between then and now will probably appear in the coming months.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.40203-619573</guid>
		<pubDate>Wed, 14 Jun 2006 21:11:41 -0800</pubDate>
		<dc:creator>mbrubeck</dc:creator>
	</item><item>
		<title>By: MetaMonkey</title>
		<link>http://ask.metafilter.com/40203/I-need-to-download-google#619586</link>	
		<description>Looking at the format of the page urls, they are all of the form, em411.com/forum/xxxxx/yyyy&lt;br&gt;
&lt;br&gt;
Where xxxxx is the thread and yyyy is the page number.&lt;br&gt;
&lt;br&gt;
In order to get just the pages you want, one by one, search with this format,&lt;br&gt;
&lt;br&gt;
&lt;code&gt;site:em411.com/forum/xxxxx/yyyy&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
I didn&apos;t figure out the numbering rules or patterns, but you could just start at x~25000 and y=1 and go through the whole lot until the end.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.40203-619586</guid>
		<pubDate>Wed, 14 Jun 2006 21:39:48 -0800</pubDate>
		<dc:creator>MetaMonkey</dc:creator>
	</item><item>
		<title>By: xulu</title>
		<link>http://ask.metafilter.com/40203/I-need-to-download-google#619589</link>	
		<description>Sorry I missed that part.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.40203-619589</guid>
		<pubDate>Wed, 14 Jun 2006 21:44:27 -0800</pubDate>
		<dc:creator>xulu</dc:creator>
	</item><item>
		<title>By: StickyCarpet</title>
		<link>http://ask.metafilter.com/40203/I-need-to-download-google#619791</link>	
		<description>If you do get it to work, could you zip it and yousendit here as a followup?</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.40203-619791</guid>
		<pubDate>Thu, 15 Jun 2006 07:16:58 -0800</pubDate>
		<dc:creator>StickyCarpet</dc:creator>
	</item><item>
		<title>By: bigmusic</title>
		<link>http://ask.metafilter.com/40203/I-need-to-download-google#619829</link>	
		<description>I don&apos;t know how to code anything, so I don&apos;t think I&apos;d be able to script anything together. I was hoping that there was an app for this.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.40203-619829</guid>
		<pubDate>Thu, 15 Jun 2006 07:49:21 -0800</pubDate>
		<dc:creator>bigmusic</dc:creator>
	</item>
	</channel>
</rss>
