Comments on: Is there an easy script-way to download 15,000 pages of a website with incremental URLs?

Question: Is there an easy script-way to download 15,000 pages of a website with incremental URLs?

monju_bosatsu — Thu, 10 Jun 2004 12:57:40 -0800

I need to mass download a chunk of a website. It's about 15000 pages, with identical urls except for one portion with a sequential numerical indicator for each page. I don't need to spider any links, I just need it to work through the list of pages. I know there's got to be an easy script-y way of doing this, but you know, I'm a lawyer. Please help!

By: mmcg

mmcg — Thu, 10 Jun 2004 13:11:19 -0800

Try using wget. You can find a version for windows here, linux comes with it preinstalled (usually), and the mac port is on versiontracker.

For wget to work as painlessly as possible, it would be best if the site contains some central HTML file with links to everything you want to download, but if it does not you could basically set the program to mirror the whole site and sort through the output when you're done.

By: mnology

mnology — Thu, 10 Jun 2004 13:12:22 -0800

Use the fusk command in URLToys to create your list of url's to download. Then get.

Example: fusk http://www.metafilter.com/mefi/[10000-20000]

Would create a list of url's for a chunk of MeFi threads.

By: skynxnex

skynxnex — Thu, 10 Jun 2004 13:14:26 -0800

Curl, which runs under Unixs and Windows, supports this directly; run this on a command line of choice (you may have to drop the quotes under Windows):

curl 'http://whatever.com/something/#[00001-15000].html' -o '#1'.html

will grab the range every file from 00001.html to 15000.html; drop the leading zeros if your file names don't support them. You can also use a simple

the curl man page has the documentation on this. Look near the beginning and then under the -o option.

By: nakedcodemonkey

nakedcodemonkey — Thu, 10 Jun 2004 13:35:53 -0800

Before downloading a whole site of that size, it would be nice if you talked to the webmaster first. Many sites forbid this kind of activity in their TOS, because they have to pay the bandwidth bill for your scraping. If it's a small operator, you could be putting the hurt on. If your need is legitimate (and you're not doing this to sue them), they may be willing to help you get the data in a considerably more resource-effective manner (i.e. a *.zip of their files). Shooting 15,000 requests at someone's server shouldn't normally be Plan A. It is, at a minimum, rude. And if they have decent security/throttling measures in place, there's a chance your IP will get banned before the scrape completes.

By: monju_bosatsu

monju_bosatsu — Thu, 10 Jun 2004 13:41:54 -0800

It's yahoo, so I'm not sure they'll mind. That's a good point, though.

By: littlegreenlights

littlegreenlights — Thu, 10 Jun 2004 13:48:05 -0800

I used to use a program called Black Widow for this. That was years ago, but this seems to be the place.

By: fvw

fvw — Thu, 10 Jun 2004 14:14:47 -0800

for i in $(seq -w 1 15000); do wget http://foo/bar/$i.html; done

By: monju_bosatsu

monju_bosatsu — Thu, 10 Jun 2004 14:18:54 -0800

Got it to work on small test batches with fusker and wget. Tried curl, but kept getting error messages. Thanks all!

By: jeb

jeb — Thu, 10 Jun 2004 14:58:32 -0800

Yahoo has security or throttling measures in place. If you download too much stuff from Yahoo they will ban your IP for a while (a few days in my experience). I'm not sure what the limits are, but my ip got banned when doing like 70 requests per minute, but not when doing like 5. I didn't try numbers in between.

By: mrbill

mrbill — Thu, 10 Jun 2004 19:25:05 -0800

wget -np -m http://base.url.here.com