<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

	<title>Comments on: Is there an easy script-way to download 15,000 pages of a website with incremental URLs?</title>
	<link>http://ask.metafilter.com/7859/Is-there-an-easy-scriptway-to-download-15000-pages-of-a-website-with-incremental-URLs/</link>
	<description>Comments on Ask MetaFilter post Is there an easy script-way to download 15,000 pages of a website with incremental URLs?</description>
	<pubDate>Thu, 10 Jun 2004 13:11:19 -0800</pubDate>
	<lastBuildDate>Thu, 10 Jun 2004 13:11:19 -0800</lastBuildDate>
	<language>en-us</language>
	<docs>http://blogs.law.harvard.edu/tech/rss</docs>
	<ttl>60</ttl>

	<item>
		<title>Question: Is there an easy script-way to download 15,000 pages of a website with incremental URLs?</title>
		<link>http://ask.metafilter.com/7859/Is-there-an-easy-scriptway-to-download-15000-pages-of-a-website-with-incremental-URLs</link>	
		<description>I need to mass download a chunk of a website.   It&apos;s about 15000 pages, with identical urls except for one portion with a sequential numerical indicator for each page.  I don&apos;t need to spider any links, I just need it to work through the list of pages.  I know there&apos;s got to be an easy script-y way of doing this, but you know, I&apos;m a lawyer.  Please help!</description>
		<guid isPermaLink="false">post:ask.metafilter.com,2004:site.7859</guid>
		<pubDate>Thu, 10 Jun 2004 12:57:40 -0800</pubDate>
		<dc:creator>monju_bosatsu</dc:creator>
		
			<category>software</category>
		
			<category>download</category>
		
			<category>script</category>
		
			<category>wget</category>
		
			<category>curl</category>
		
	</item> <item>
		<title>By: mmcg</title>
		<link>http://ask.metafilter.com/7859/Is-there-an-easy-scriptway-to-download-15000-pages-of-a-website-with-incremental-URLs#154507</link>	
		<description>Try using wget. You can find a version for windows &lt;a href=&quot;http://www.interlog.com/~tcharron/wgetwin.html&quot;&gt;here&lt;/a&gt;, linux comes with it preinstalled (usually), and the mac port is on versiontracker.&lt;br&gt;
&lt;br&gt;
For wget to work as painlessly as possible, it would be best if the site contains some central HTML file with links to everything you want to download, but if it does not you could basically set the program to mirror the whole site and sort through the output when you&apos;re done.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2004:site.7859-154507</guid>
		<pubDate>Thu, 10 Jun 2004 13:11:19 -0800</pubDate>
		<dc:creator>mmcg</dc:creator>
	</item><item>
		<title>By: mnology</title>
		<link>http://ask.metafilter.com/7859/Is-there-an-easy-scriptway-to-download-15000-pages-of-a-website-with-incremental-URLs#154508</link>	
		<description>Use the fusk command in &lt;a href=http://urltoys.com&gt;URLToys&lt;/a&gt; to create your list of url&apos;s to download. Then get.&lt;br&gt;
&lt;br&gt;
Example: fusk http://www.metafilter.com/mefi/[10000-20000] &lt;br&gt;
&lt;br&gt;
Would create a list of url&apos;s for a chunk of MeFi threads.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2004:site.7859-154508</guid>
		<pubDate>Thu, 10 Jun 2004 13:12:22 -0800</pubDate>
		<dc:creator>mnology</dc:creator>
	</item><item>
		<title>By: skynxnex</title>
		<link>http://ask.metafilter.com/7859/Is-there-an-easy-scriptway-to-download-15000-pages-of-a-website-with-incremental-URLs#154512</link>	
		<description>&lt;a href=&quot;http://curl.haxx.se/&quot;&gt;Curl&lt;/a&gt;, which runs under Unixs and Windows, supports this directly; run this on a command line of choice (you may have to drop the quotes under Windows):&lt;br&gt;
&lt;br&gt;
  curl &apos;http://whatever.com/something/#[00001-15000].html&apos; -o &apos;#1&apos;.html&lt;br&gt;
&lt;br&gt;
will grab the range every file from 00001.html to 15000.html; drop the leading zeros if your file names don&apos;t support them. You can also use a simple&lt;br&gt;
&lt;br&gt;
&lt;a href=&quot;http://curl.haxx.se/docs/manpage.html&quot;&gt;the curl man page&lt;/a&gt; has the documentation on this. Look near the beginning and then under the -o option.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2004:site.7859-154512</guid>
		<pubDate>Thu, 10 Jun 2004 13:14:26 -0800</pubDate>
		<dc:creator>skynxnex</dc:creator>
	</item><item>
		<title>By: nakedcodemonkey</title>
		<link>http://ask.metafilter.com/7859/Is-there-an-easy-scriptway-to-download-15000-pages-of-a-website-with-incremental-URLs#154529</link>	
		<description>Before downloading a whole site of that size, it would be nice if you talked to the webmaster first.  Many sites forbid this kind of activity in their TOS, because they have to pay the bandwidth bill for your scraping.  If it&apos;s a small operator, you could be putting the hurt on.  If your need is legitimate (and you&apos;re not doing this to sue them), they may be willing to help you get the data in a considerably more resource-effective manner (i.e. a *.zip of their files).  Shooting 15,000 requests at someone&apos;s server shouldn&apos;t normally be Plan A.  It is, at a minimum, rude.  And if they have decent security/throttling measures in place, there&apos;s a chance your IP will get banned before the scrape completes.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2004:site.7859-154529</guid>
		<pubDate>Thu, 10 Jun 2004 13:35:53 -0800</pubDate>
		<dc:creator>nakedcodemonkey</dc:creator>
	</item><item>
		<title>By: monju_bosatsu</title>
		<link>http://ask.metafilter.com/7859/Is-there-an-easy-scriptway-to-download-15000-pages-of-a-website-with-incremental-URLs#154531</link>	
		<description>It&apos;s yahoo, so I&apos;m not sure they&apos;ll mind.  That&apos;s a good point, though.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2004:site.7859-154531</guid>
		<pubDate>Thu, 10 Jun 2004 13:41:54 -0800</pubDate>
		<dc:creator>monju_bosatsu</dc:creator>
	</item><item>
		<title>By: littlegreenlights</title>
		<link>http://ask.metafilter.com/7859/Is-there-an-easy-scriptway-to-download-15000-pages-of-a-website-with-incremental-URLs#154541</link>	
		<description>I used to use a program called Black Widow for this.  That was years ago, but &lt;a href=&quot;http://sbl.net/Frames.html?f1=Banner.html&amp;f2=BlackWidow/index.html&quot;&gt;this&lt;/a&gt; seems to be the place.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2004:site.7859-154541</guid>
		<pubDate>Thu, 10 Jun 2004 13:48:05 -0800</pubDate>
		<dc:creator>littlegreenlights</dc:creator>
	</item><item>
		<title>By: fvw</title>
		<link>http://ask.metafilter.com/7859/Is-there-an-easy-scriptway-to-download-15000-pages-of-a-website-with-incremental-URLs#154556</link>	
		<description>&lt;code&gt;for i in $(seq -w 1 15000); do wget http://foo/bar/$i.html; done&lt;/code&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2004:site.7859-154556</guid>
		<pubDate>Thu, 10 Jun 2004 14:14:47 -0800</pubDate>
		<dc:creator>fvw</dc:creator>
	</item><item>
		<title>By: monju_bosatsu</title>
		<link>http://ask.metafilter.com/7859/Is-there-an-easy-scriptway-to-download-15000-pages-of-a-website-with-incremental-URLs#154560</link>	
		<description>Got it to work on small test batches with fusker and wget.  Tried curl, but kept getting error messages.  Thanks all!</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2004:site.7859-154560</guid>
		<pubDate>Thu, 10 Jun 2004 14:18:54 -0800</pubDate>
		<dc:creator>monju_bosatsu</dc:creator>
	</item><item>
		<title>By: jeb</title>
		<link>http://ask.metafilter.com/7859/Is-there-an-easy-scriptway-to-download-15000-pages-of-a-website-with-incremental-URLs#154575</link>	
		<description>Yahoo has security or throttling measures in place.  If you download too much stuff from Yahoo they will ban your IP for a while (a few days in my experience).   I&apos;m not sure what the limits are, but my ip got banned when doing like 70 requests per minute, but not when doing like 5.  I didn&apos;t try numbers in between.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2004:site.7859-154575</guid>
		<pubDate>Thu, 10 Jun 2004 14:58:32 -0800</pubDate>
		<dc:creator>jeb</dc:creator>
	</item><item>
		<title>By: mrbill</title>
		<link>http://ask.metafilter.com/7859/Is-there-an-easy-scriptway-to-download-15000-pages-of-a-website-with-incremental-URLs#154649</link>	
		<description>wget -np -m http://base.url.here.com</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2004:site.7859-154649</guid>
		<pubDate>Thu, 10 Jun 2004 19:25:05 -0800</pubDate>
		<dc:creator>mrbill</dc:creator>
	</item>
	</channel>
</rss>
