Join 3,414 readers in helping fund MetaFilter (Hide)


auto-name saved webpages
October 7, 2009 6:41 AM   Subscribe

Is there a way of saving a web-page from a web-browser, using a name from a source outside of the web-page?

I want to save several web-pages to a folder, and I would like a semi-automatic naming convention. Previously, I have been manually saving each web-page by hitting ctrl-s, and entering sequential numbers. I am analyzing these pages using a loop, so this convention makes it quite easy to loop through all files in the folder (I am doing this analysis using a few VBA macros in excel).

One mild complication is that some of these web-pages will all have the same default name if I use the actual prompted name, so re-naming after saving is not an option as it would overwrite files.

I haven't been able to figure out how to go about doing this within firefox on windows xp, so any pointers would be greatly appreciated. Suggestions using alternative browsers/OSs are fine as I use mac and linux also.
posted by a womble is an active kind of sloth to Computers & Internet (8 answers total)
 
You want the Unix command curl, which will get a page and save it to a local file.

curl -s -o /destination/path/file.html http://www.sourceURL.com/pageyouwant.html

Works great a cron schedule. Options here.
posted by rokusan at 6:44 AM on October 7, 2009


Yeah, the unix tools curl or wget should do what you need. There's win32 ports of both.
posted by reptile at 6:45 AM on October 7, 2009


Unfortunately I need to click some options on some of the web-pages, to get the page to display what I need. Not that I know how to do it, but any external javascript manipulation is not allowed. So the web-page is some huge alphanumeric string, and to get to the next page, it keeps the same string in the browser.

I should probably clarify that what I am doing is legit. I am trying to get data from a site that I have a license for, but that makes it extremely difficult to more than one result at a time with out using their painfully slow user interface.
posted by a womble is an active kind of sloth at 6:52 AM on October 7, 2009


You can automate the interactions with any webpage using curl or wget, but obviously you might need to re-implement their javascript, which might prove painful. But it's simply not possible to disallow external javascript manipulation since Safar, FireFox, etc. will run any javascript or extension you tell them.

I once wrote this little script to grab the cookies from the front Safari window, and save them for use with curl or wget. So you may login & establish session manually, the hard part under curl, but then automate your downloads, the hard part in Safari. I'm sure there are similar FireFox extensions for exporting the cookies too.

#!/bin/bash
echo -n "Set-Cookie: "
osascript <<_EOF_ | perl -pe "s/; /\\nSet-Cookie: /g"
tell application "Safari"
do JavaScript "document.cookie" in document 1
end tell
_EOF_
posted by jeffburdges at 7:29 AM on October 7, 2009


Cookies can make things hard, but not impossible. You might need to write an entire script to do what you need.

(That said, I never have been able to figure out how to automate the cookie-heavy pages from sandbox.com, so that I could scrape-and-save my fantasy football team's results. Even with sniffing and setting the cookies properly, it still fails to reach the login-required pages. Sigh.)
posted by rokusan at 8:34 AM on October 7, 2009


Sadly I think writing an entire script is beyond my current programming knowledge. I do appreciate the suggestions though.

Is there such things as a 'mouse emulator'? I was just wondering if I could do this through something like python controlling firefox, where I save a page, and then get my mouse to click to advance through the results.
posted by a womble is an active kind of sloth at 8:56 AM on October 7, 2009


Look into the program AutoIt which is a free little scripting language that lets you automate interacting with windows, controls, etc, including stuff like "click the button named THIS" or "click the right mouse button at X=125, Y=180 in this window". It's a last resort for me when I can't script something through other means. I've used it for stuff like automating installs of programs that ask a lot of questions and require a lot of interaction to get installed.

Whatever it is the web page you're using wants you to do, you can probably get around with with curl/wget + some programming, but, it could be rather complicated and tricky.
posted by RustyBrooks at 10:05 AM on October 7, 2009 [1 favorite]


Thanks! that is an amazing program, I think it should do the trick, once I figure out how to save a web-page using it.
posted by a womble is an active kind of sloth at 10:14 AM on October 7, 2009


« Older Please share advice for non-de...   |  I had a dream about an ex-boyf... Newer »
This thread is closed to new comments.