I just want to save my web site locally
January 24, 2008 8:41 PM   Subscribe

I'm willing to pay for a really good tool to save a copy of a web site locally, downloading through HTTP. HTTrack isn't quite cutting it, and I need CSS background images.

I also used a free trial of WebCopy, and did a bit of research before downloading that. It does a lot, but (as far as I know, and it's been a few months) it doesn't detect and save background-image properties specified in a (separate) CSS file.

Can anyone recommend a really good web site downloader/copier that just works well and magically? I want it to recurse through the entire site, saving all images, CSS files, JS files, images referenced in CSS files, and make all site-internal links work locally, such that I can put the entire site onto a CD/DVD for reference.

This is for backing up my work, created in a dynamically-served CMS. Links tend to look like http://domainname.com/section/subsection/ (with or without the trailing slash, not necessarily a .html extension). I'd like the tool to be able to handle CGI query strings too, though.

Many thanks if this works. My budget isn't huge, I'm thinking $30-$40, but am interested in the state of the art.

And if Firefox has an add-on that does this, I'll eat my hat. I did look.
posted by amtho to Computers & Internet (18 answers total) 8 users marked this as a favorite
I recommend Teleport Pro. $40, gets a 5-star editor's rating from download.com.
posted by blahtsk at 8:50 PM on January 24, 2008

GNU wget. Googling for wget gives me the gnu page and the windows port of the same utility. Once installed you'll want to do something like:

wget -r -k http://www.mysite.com

from the command line, then look in the directory www.mysite.com for your website's files.

Free software, you can pay me a $30 consultation fee if you like ;-) Unless there's something appalling about your website's construction (javascript links mainly) then this should work fine.
posted by singingfish at 8:53 PM on January 24, 2008

probably wget -r -k -p http://mysite.com
posted by singingfish at 8:55 PM on January 24, 2008

Last time I checked (which was a few years ago) wget didn't parse CSS files for background images.
posted by alan at 9:00 PM on January 24, 2008

I actually have messed with wget a bit, but it's been a while. Truthfully, I'd just as soon be able to run this on Windows (I know I could probably get wget to run under Windows, but argh, I'm tired of configuring things at the moment). Don't get me wrong, I'm willing to be convinced, and I did configure wget years ago on my FreeBSD server, but I don't want to mess with it now unless it's really the only choice. Not because it's bad, but because I'm tired and in a hurry.

The whole reason I'm even looking at this stuff right now is that my once old reliable web hosting company seems to be infested with gremlins lately and I'm just tired of dealing with things that aren't as easy as they should be.

Will Teleport pro, or some hitherto-unknown-to-me Windows GUI version of wget, for sure handle the referenced background-images in my referenced .css files?
posted by amtho at 9:01 PM on January 24, 2008

I've used Teleport Pro and it is very configurable to follow every link and reference on a site. I saved a guys MT site when he borked the mysql database beyond repair.

The trial version is fully functional, so you can see what you will get, but I think it only follows 10 links at a time until you pay.
posted by Argyle at 9:06 PM on January 24, 2008

wget is really simple to install - just drop wget.exe into C:\WINDOWS. Try this one.
posted by fleeba at 9:37 PM on January 24, 2008

I assume that you're lacking FTP access or something... do you have any way to turn on directory listing? Because then you could run HTTrack against the directory listings and you ought to get every file.
posted by XMLicious at 9:40 PM on January 24, 2008

wget -k -m -r http://example.com should get images too (the -m part mirrors).
posted by fleeba at 9:47 PM on January 24, 2008

XMLicious - I do have FTP access, but the HTML that a browser loads, for these sites, is generated by various server scripts - if I did an FTP download, I'd just get Perl code, not web pages.

Thanks, fleeba - if Teleport Pro doesn't work out, I may try to sift through the wget documentation to see if it can be configured to download CSS background-images automatically.

I'm trying out Teleport Pro; it looks promising, but I'm running into an issue where only one page is downloaded. Yes, I already tried (briefly) turning off my firewall, and increasing the time between downloads. I'll probably write their tech support people tomorrow; they seem pretty together, as far as I can tell. Now I have got to go to sleep (it's nearly 1 AM here).

Thanks for the responses so far! If anyone is particularly interested, the site I'm trying to make a local copy of, at the moment, is here: http://elfoundation.org/ (although I need this tool to work for other sites, too).
posted by amtho at 9:52 PM on January 24, 2008

If you have FTP access, can't you just download the site, then copy the HTTrack version on top of it? That ought to give you any missing files but flat HTML instead of perl code.
posted by XMLicious at 10:07 PM on January 24, 2008

XMLicious - I've done something like that before, but I'd prefer a solution that doesn't include me hand-editing a bunch of files to correct references to various linked files. As I said, this would be used for multiple sites in the future, and I'd like to be able to let it run, not worry about it, and not have to check each file looking for what's missing (and risk overlooking something).
posted by amtho at 10:33 PM on January 24, 2008

Have a look at WinHTTrack Website Copier - very nice, simple free tool.
posted by worldshift at 5:49 AM on January 25, 2008

Thanks, worldshift, but I actually have tried HTTrack/WinHTTrack (mentioned in original question, but no worries, I miss things like that all the time).
posted by amtho at 7:00 AM on January 25, 2008

There's a patch to allow wget to parse CSS files, solving this very problem, which of course since it's just a patch (so far, although the bug report showing that they're "working on including it" is from middle of last year!) doesn't fit your criteria of Windows and not having to do a lot of messing around to get it to work, but if nothing else is working for you... (If you do get it patched and compiled in Windows, could you send me a copy? Thanks!)
posted by anaelith at 7:21 AM on January 25, 2008

amtho, I believe you that it didn't work of course but it surprises me. If HTTrack isn't detecting the image paths in the CSS to rewrite or download them, I wouldn't expect it to alter them at all - I would think you could put all the files in the same relative place they're in on the server and things would work. You might end up with duplicates but I would expect any problems due to files missed by HTTrack to be fixed.

So does this maybe mean that the problems are more extensive than you suspect?
posted by XMLicious at 2:05 PM on January 25, 2008

amtho, I was able to download all the background images with the wget -m -r -k http://elfoundation.org/. I compiled 1.10.1 on my Mac - nothing extra is in it. The problem with that site is it's inaccessible to js-disabled devices like wget (and search engines too), so grabbing the interior pages isn't possible unless you use direct URLs. However, none of the links on that page actually work, so if you just wanted to mirror just the index page, then wget is fine (and free).
posted by fleeba at 5:33 PM on January 25, 2008

Just a note on this very old topic that my CSS parsing patches were landed on Wget's trunk a few months ago, and will be included in the Wget 1.12 release. You should be able to grab the latest source, build it, and use it just fine.
posted by Ted Mielczarek at 8:58 AM on December 22, 2008

« Older No luck on the dating scene.   |   Why is the smoky eye popular? Newer »
This thread is closed to new comments.