Join 3,374 readers in helping fund MetaFilter (Hide)


Need to set up an automatic download because humans are forgetful and lazy.
June 17, 2010 11:50 AM   Subscribe

I'm looking to do an automated download from a website that requires a login and clicking a javascript button and the usual tricks are failing me. Ideas?

We've got to pull a file overnight every weekday from a website where the data is posted. The site itself requires a login/password, and then once you're in, you get a table with the most recent file being the same element on top.

This is an unusual one for us as every other place we download data of this nature from provides us an sftp site to pull it from, but these guys don't. They know we are going to automate the process and they're fine with that, but also they're also taking the "it's good enough for everyone else" position and not willing to help automate it.

In similar situations I've used wget, but it doesn't seem to be the tool for the job here. The login plus the javascript button are frustrating my attempts to use it. Anyone have any ideas? Something that can be run from a cronjob is ideal.
posted by barc0001 to Computers & Internet (13 answers total) 2 users marked this as a favorite
 
oh, here I was thinking an imacro from firefox. you're all unix-ey though.
posted by TomMelee at 12:08 PM on June 17, 2010


What's the javascript button do?
posted by soma lkzx at 12:29 PM on June 17, 2010


I'd recommend Watir. It's based on ruby, but even if you've never used ruby the installation and examples are straight forward. It interfaces the web through a real browser, so anything you can do through a browser you can do via watir.

It is admittedly overkill from a functionality perspective, but it's trivial to deal with cookie sessions, javascript redirects and other things that can throw wrinkles into wget, curl and mechanize solutions.
posted by forforf at 12:30 PM on June 17, 2010


Just because the button uses javascript doesn't mean that deep down it's not just a regular HTTP request at the core, which can be simulated with an appropriate curl/wget command line. Your job is to find the actual HTTP request. There are numerous tools to do this: Firebug, Tamper Data, Wireshark, etc.
posted by Rhomboid at 12:34 PM on June 17, 2010 [2 favorites]


Perhaps with a spare XP/7 machine (or creating a vm with windows guest) and using autoit? Apparently it works with WINE as well. I see a forum post dealing with Java apps, and here is someone using it to login to websites.
posted by dozo at 12:43 PM on June 17, 2010


Sikuli let's you program scripts based on mini-screenshots.
posted by Brent Parker at 12:44 PM on June 17, 2010


I've had success in the past with HtmlUnit, which is a Java 'browser' which executes JavaScript. If you prefer Ruby to Java, it's available (via JRuby) as Celerity.

As others have said, though, it's generally preferable to avoid this and go HTTP-only if that's in any way simpler.
posted by smcg at 1:10 PM on June 17, 2010


I'll give Autoit a go, as it seems like the quickest way to get this up and running and then do a more bulletproof version later if need be. Watir also looks intriguing for not only this but some other things we were thinking of doing as well. I tried digging out the actual request URL with Firebug for a while yesterday but had no luck so far. Might give Wireshark a go in that regard. Thanks for the ideas everyone, this definitely gives me a few other options to try.
posted by barc0001 at 1:52 PM on June 17, 2010


2nding Rhomboid. wget can handle the POST requests and cookies associated with a login. Here's an example:

wget --post-data='hiddenfield=yes&username=cowbellemoo2&password=secret' \
--save-cookies=my-cookies.txt --keep-session-cookies \
http://example.com/login.php

wget --load-cookies=my-cookies.txt --save-cookies=my-cookies.txt \
--keep-session-cookies http://example.com/file-needed.doc


The hard part is figuring out what to put into POST to login successfully and where your destination file is. Like Rhomboid suggested, firebug and wireshark can help you do that. They'll see through the AJAX or whatever into the network requests and reveal the actual location of the file you want.
posted by cowbellemoo at 2:00 PM on June 17, 2010


Are the filenames at all consistent?

I've done similar things in the past with shell scripts and expect to automate FTP and telnet sessions.

You could basically telnet to port 80 of the web server and feed it the appropriate http commands to get the file of the day.
posted by jjb at 2:14 PM on June 17, 2010


Unfortunately Wireshark and Firebug have not been able to entirely help. I think Wireshark's having trouble because the site is https, so it never sees inside the packet payloads, and firebug's being a pain for whever reason and not actually kicking back an URL. I do have all of the post info, so I might give that a go anyway with wget, but I'll have to see.
posted by barc0001 at 2:32 PM on June 17, 2010


Try https://addons.mozilla.org/en-US/firefox/addon/3829/ a plugin for Firefox to show you what's going on with http.
posted by forforf at 6:45 PM on June 17, 2010


Sorry borked the link, here it is: LiveHTTP Headers
posted by forforf at 6:46 PM on June 17, 2010


« Older Help me approximate the FMyLif...   |  I have a little female dog (10... Newer »
This thread is closed to new comments.