It's just a page with links on it...why is it so complicated?
September 6, 2006 11:05 AM   Subscribe

What's the best way to save files from a website that requires authentication?

I'm trying to save a list of search results from an online database. Right now it's just a page with 400 links on it, and I want both the index file that I'm seeing as well as the pages that are linked (1 level.)

What is the best way to do this? I've tried using acrobat, but I can't over the authentication. The offline browsers I've tried returned permission denied errors even though I've set up the security sections with username/password.

Is there a macro that can achieve this?
posted by Sallysings to Computers & Internet (15 answers total)
 
Are you on Windows, OSX, or other?
posted by unixrat at 11:09 AM on September 6, 2006


You may need to change your useragent string in the offline browser.

Alternately, an in-browser solution like DownThemAll might be able to get all the content at the 400+links.
posted by fake at 11:12 AM on September 6, 2006


You mention a "macro" -- do you need this to be automated, or do you just mean "a simple solution that I don't mind launching on an as-needed basis?"
posted by Doofus Magoo at 11:14 AM on September 6, 2006


Does it require authentication through an online form or though a popup username/password box?
posted by null terminated at 11:14 AM on September 6, 2006


What kind of authentication is it?

If it's HTTP BASIC authentication then you're in luck. Sites which use BASIC auth will pop up a Windows dialog box asking for a username and password when you visit them, as opposed to a login form on the actual page.

If it is HTTP BASIC, you can use wget.

If not, Perl scripts (or similar) are your best bet.
posted by matthewr at 11:21 AM on September 6, 2006


I'm on Windows XP, using Mozilla Firefox. DownThemAll didn't work; it LISTED the files, then when it tries to download them, they defaulted to the page where it asks for the username and password.

(I believe it's asp.)
posted by Sallysings at 11:30 AM on September 6, 2006


Oh, when I said "macro" I meant something that would go through the page, acts like a PERSON and click through each page, save it to a file somewhere on the computer, and maybe rename the files so that they are the description of the ilfes (instead of a number), then change all the links from that index page so it would correspond to the same links.

I might be doing this once every month or so for a year and clicking 400 links and saving them isn't anybody's idea of fun...
posted by Sallysings at 11:32 AM on September 6, 2006


WinHTTrack has a capture utility that can mirror stuff behind form based authorization.
posted by Mitheral at 11:56 AM on September 6, 2006


you could just view the raw html in -> view -> source and copy and paste it into a Word document. the link are the ones in the whatever tags
posted by wildster at 11:56 AM on September 6, 2006


opps i forgot to escape so the links are in the href="http://www.whatever.com" whatever /a> tags
posted by wildster at 12:00 PM on September 6, 2006


Expanding on what matthewr said about wget. If it is the HTTP BASIC you can use wget to download the files.

Something like:

wget -np -m --http-user=YOUR_USER_NAME --http-passwd=YOUR_PASSWORD http://www.example.com/yourplace/

might do the trick.

If the site uses a custom login page you can use the --load-cookies flag to load the cookies from Mozilla/Firefox which should have your authorization info.

This would work as follows:

1) Log into the site with Mozilla or Firefox
2) Keep the window open so the cookie doesn't go away

Now do something like this:

wget -np -m --load-cookies "C:\Documents and Settings\Your User Name\Application Data\Mozilla\Firefox\Profiles\nb3gficj.default\cookies.txt" http://www.example.com/yourplace/

This will tell wget to use the cookies file from firefox which should allow the authentication to work.

You'll need to change the path to the cookies file depending on your setup.

I hope this works, if it doesn't let us know what the error is.
posted by mge at 12:04 PM on September 6, 2006


Tried winHttrack. Got an error "access denied."

(It's not http basic)

(tears out some hair)
posted by Sallysings at 12:45 PM on September 6, 2006


If it's not http basic then you probably are going to need to go the cookies route. Lots of programs can read from your IE or firefox or whatever cookie file, so that's a good place to start. Also, you're probably going to need to send the proper referer, which wget etc are probably going to do.

If worse comes to worse you can write short programs to do it in perl, tcl, etc.

Also, I've seen people do this as a javascript "bookmarklet" (well, stuff like this) and that seemed to work for them. I'm not that knowledgeable about javascript so I wouldn't know where to start.
posted by RustyBrooks at 1:03 PM on September 6, 2006


I wrote a program for another MeFite to do just this kind of thing in Perl not long ago. It's not super-difficult although it was fiddly as hell because of the vagaries of the website. Want to email me?
posted by AmbroseChapel at 4:50 PM on September 6, 2006


Sallysings writes "Tried winHttrack. Got an error 'access denied.'"

Did you get the access denied when you did the Capture URL or when you ran the project?

If the latter there are two things to try when setting options:

1) go to the Browser ID tab and select any of the MSIE selections as the browser identity
2) go to the spider tab and set the spider drop down to ignore robots.txt
posted by Mitheral at 7:44 AM on September 7, 2006


« Older Has anyone never heard of the Beatles?   |   Can the hive mind help me optimize my Linux... Newer »
This thread is closed to new comments.