Help me find a good website ripper/offline browser!
February 14, 2008 8:50 AM   Subscribe

What is the best offline web browser?

Ok... On occasion, I find that I want to grab all of the files of a specific filename pattern off a website, and ONLY those files. Just as an example, say a site that had multiple bit rates of mp3 files available for download with specific naming conventions for the formats.

So, if I wanted all the 192 kbps versions of each file, I would want to specify *192*.mp3 and ignore all other mp3 files.

I'm looking for an offline web browser/file downloader/etc that lets me specify a site, and an optional login/password if necessary, and the specific file type that I want, while letting me disregard all other file types.

So far, I've only tried HTTrack, which is wonderful and free, but it won't let me put in *192*.mp3 without specifically putting in all other file formats to disregard. And that's not really that useful to me. It's open source, too, so I could theoretically change it myself, but I can't get it compiled in Windows, nor can I find an easy walkthrough on how to do it. I'd hate to have to go through all the offline browser/website rippers/etc to find out which ones have the features I need, so I was hoping someone on here had experience with these kinds of apps.

This is for Windows, as well, so no Mac products please.
posted by antifuse to Computers & Internet (11 answers total) 3 users marked this as a favorite
 
wget -X
posted by rhizome at 9:16 AM on February 14, 2008


If you're using Firefox, you might give DownThemAll a try.
posted by box at 9:22 AM on February 14, 2008


Best answer: The HTTrack FAQ has an example just like that:

+www.someweb.com/*blue*.jpg
posted by Zed_Lopez at 9:53 AM on February 14, 2008


A simple perl script + curl could easily filter out all the links you want, then download each of the linked files (with curl again).
posted by mphuie at 9:56 AM on February 14, 2008


I imagine Teleport Pro could accomplish this.
posted by phaded at 10:46 AM on February 14, 2008


Best answer: This might be a dumb question, especially since you mention editing the source code, but are you sure that HTTrack won't do what you want?

After reading this page, it looks to me like setting a couple scan rules, along the lines of 'filenames with extension mp3' and 'file names containing 192,' would accomplish exactly what you want to do.
posted by box at 10:51 AM on February 14, 2008


Response by poster: The thing about HTTrack is, the scan rules aren't quite so smart. The inclusion rules are kind of dumb, actually, because unless you specifically exclude something (say, *96*.mp3), it will download them any way. So really, the only way you can add rules that are useful is if you have specific things that you want excluded. I've tried just putting in "+*192*.mp3" as an inclusion rule, but then it still goes and downloads every other type of mp3 file as well.

I always forget about wget... maybe I'll see if I can find a good tutorial on the web for it somewhere.
posted by antifuse at 11:13 AM on February 14, 2008


Best answer: Although, looking at httrack's advanced rules pages, it looks like the order of rules might make a difference. So perhaps that was my problem? Maybe if I do "-*.mp3 +*192*.mp3" it'll work?
posted by antifuse at 11:16 AM on February 14, 2008


Best answer: Ugh. Ok, anybody looking for a dumbass? Yeah, that would be me - turns out, the reason WHY it kept downloading EVERYTHING even though I only wanted the *192*.mp3 files is because there was a "192" in the password for the site... so even though it wasn't technically part of the URL, because HTTrack uses the http://user:password@siteaddress.com/blah format to fetch links, it's not smart enough to exclude the user/pass from the scan rule matching.
posted by antifuse at 11:36 AM on February 14, 2008


Response by poster: I marked all the HTTrack answers as best answer, because they helped me find out my dumbassedness. Thanks folks! :)
posted by antifuse at 12:22 PM on February 14, 2008


Awesome--glad you got it working.
posted by box at 6:07 PM on February 15, 2008


« Older How to remove Sender header ("on behalf of") when...   |   The Girl's Guide to Hunting for Fish Newer »
This thread is closed to new comments.