Is there an easy script-way to download 15,000 pages of a website with incremental URLs?
June 10, 2004 12:57 PM Subscribe
I need to mass download a chunk of a website. It's about 15000 pages, with identical urls except for one portion with a sequential numerical indicator for each page. I don't need to spider any links, I just need it to work through the list of pages. I know there's got to be an easy script-y way of doing this, but you know, I'm a lawyer. Please help!
Use the fusk command in URLToys to create your list of url's to download. Then get.
Example: fusk http://www.metafilter.com/mefi/[10000-20000]
Would create a list of url's for a chunk of MeFi threads.
posted by mnology at 1:12 PM on June 10, 2004
Example: fusk http://www.metafilter.com/mefi/[10000-20000]
Would create a list of url's for a chunk of MeFi threads.
posted by mnology at 1:12 PM on June 10, 2004
Curl, which runs under Unixs and Windows, supports this directly; run this on a command line of choice (you may have to drop the quotes under Windows):
curl 'http://whatever.com/something/#[00001-15000].html' -o '#1'.html
will grab the range every file from 00001.html to 15000.html; drop the leading zeros if your file names don't support them. You can also use a simple
the curl man page has the documentation on this. Look near the beginning and then under the -o option.
posted by skynxnex at 1:14 PM on June 10, 2004
curl 'http://whatever.com/something/#[00001-15000].html' -o '#1'.html
will grab the range every file from 00001.html to 15000.html; drop the leading zeros if your file names don't support them. You can also use a simple
the curl man page has the documentation on this. Look near the beginning and then under the -o option.
posted by skynxnex at 1:14 PM on June 10, 2004
Before downloading a whole site of that size, it would be nice if you talked to the webmaster first. Many sites forbid this kind of activity in their TOS, because they have to pay the bandwidth bill for your scraping. If it's a small operator, you could be putting the hurt on. If your need is legitimate (and you're not doing this to sue them), they may be willing to help you get the data in a considerably more resource-effective manner (i.e. a *.zip of their files). Shooting 15,000 requests at someone's server shouldn't normally be Plan A. It is, at a minimum, rude. And if they have decent security/throttling measures in place, there's a chance your IP will get banned before the scrape completes.
posted by nakedcodemonkey at 1:35 PM on June 10, 2004
posted by nakedcodemonkey at 1:35 PM on June 10, 2004
Response by poster: It's yahoo, so I'm not sure they'll mind. That's a good point, though.
posted by monju_bosatsu at 1:41 PM on June 10, 2004
posted by monju_bosatsu at 1:41 PM on June 10, 2004
I used to use a program called Black Widow for this. That was years ago, but this seems to be the place.
posted by littlegreenlights at 1:48 PM on June 10, 2004
posted by littlegreenlights at 1:48 PM on June 10, 2004
for i in $(seq -w 1 15000); do wget http://foo/bar/$i.html; done
posted by fvw at 2:14 PM on June 10, 2004
Response by poster: Got it to work on small test batches with fusker and wget. Tried curl, but kept getting error messages. Thanks all!
posted by monju_bosatsu at 2:18 PM on June 10, 2004
posted by monju_bosatsu at 2:18 PM on June 10, 2004
Yahoo has security or throttling measures in place. If you download too much stuff from Yahoo they will ban your IP for a while (a few days in my experience). I'm not sure what the limits are, but my ip got banned when doing like 70 requests per minute, but not when doing like 5. I didn't try numbers in between.
posted by jeb at 2:58 PM on June 10, 2004
posted by jeb at 2:58 PM on June 10, 2004
This thread is closed to new comments.
For wget to work as painlessly as possible, it would be best if the site contains some central HTML file with links to everything you want to download, but if it does not you could basically set the program to mirror the whole site and sort through the output when you're done.
posted by mmcg at 1:11 PM on June 10, 2004