Is it possible to use wget, httrack, or something similar to download an archived directory at archive.org?
September 17, 2004 1:58 PM   Subscribe

I want to download an archived directory at Webarchive.org using an automated tool. I've tried wget, httrack, and various browsers which save web sites, but they do not download much at all. Had any sucess at this?
posted by Mo Nickels to Computers & Internet (3 answers total)
 
Some of the directories in the index listed haven't been saved by webarchive.org. If that's not your problem, you're going to have to post more information (error messages, link to status log, etc)
posted by fvw at 2:05 PM on September 17, 2004


There are no error messages and nothing useful in the wget or httrack transcript. All I'm getting is the root. I think it has something do with the redirects that Webarchive.org does. httrack allows you to ignore robots.txt, so I don't think that's the problem (since, I believe, Webarchive ordinarily prevents spidering of its content).

The wget string I was using (one of many; the rest are gone with the close of the terminal) was:

wget -r -nc -x -l 5 -H -np http://www.....
posted by Mo Nickels at 3:50 PM on September 17, 2004


Ah yes; Webarchive.org is specifying a BASE HREF in the HEAD section of the html, which is probably overriding your web spidering tool of choice's idea of where relative URLs should be retrieved (it does that for wget anyway). Either find a tool that ignores this or that lets your override it, or if you can't you'll have to hack something so it does. (perl LWP comes to mind)
posted by fvw at 4:56 PM on September 17, 2004


« Older What site allowed you to upload a file and then...   |   Bad Faith Insurance Newer »
This thread is closed to new comments.