auto-name saved webpages
October 7, 2009 6:41 AM   Subscribe

Is there a way of saving a web-page from a web-browser, using a name from a source outside of the web-page?

I want to save several web-pages to a folder, and I would like a semi-automatic naming convention. Previously, I have been manually saving each web-page by hitting ctrl-s, and entering sequential numbers. I am analyzing these pages using a loop, so this convention makes it quite easy to loop through all files in the folder (I am doing this analysis using a few VBA macros in excel).

One mild complication is that some of these web-pages will all have the same default name if I use the actual prompted name, so re-naming after saving is not an option as it would overwrite files.

I haven't been able to figure out how to go about doing this within firefox on windows xp, so any pointers would be greatly appreciated. Suggestions using alternative browsers/OSs are fine as I use mac and linux also.
posted by a womble is an active kind of sloth to Computers & Internet (8 answers total)
You want the Unix command curl, which will get a page and save it to a local file.

curl -s -o /destination/path/file.html

Works great a cron schedule. Options here.
posted by rokusan at 6:44 AM on October 7, 2009

Yeah, the unix tools curl or wget should do what you need. There's win32 ports of both.
posted by reptile at 6:45 AM on October 7, 2009

Unfortunately I need to click some options on some of the web-pages, to get the page to display what I need. Not that I know how to do it, but any external javascript manipulation is not allowed. So the web-page is some huge alphanumeric string, and to get to the next page, it keeps the same string in the browser.

I should probably clarify that what I am doing is legit. I am trying to get data from a site that I have a license for, but that makes it extremely difficult to more than one result at a time with out using their painfully slow user interface.
posted by a womble is an active kind of sloth at 6:52 AM on October 7, 2009

You can automate the interactions with any webpage using curl or wget, but obviously you might need to re-implement their javascript, which might prove painful. But it's simply not possible to disallow external javascript manipulation since Safar, FireFox, etc. will run any javascript or extension you tell them.

I once wrote this little script to grab the cookies from the front Safari window, and save them for use with curl or wget. So you may login & establish session manually, the hard part under curl, but then automate your downloads, the hard part in Safari. I'm sure there are similar FireFox extensions for exporting the cookies too.

echo -n "Set-Cookie: "
osascript <<_EOF_ | perl -pe "s/; /\\nSet-Cookie: /g"
tell application "Safari"
do JavaScript "document.cookie" in document 1
end tell
posted by jeffburdges at 7:29 AM on October 7, 2009

Cookies can make things hard, but not impossible. You might need to write an entire script to do what you need.

(That said, I never have been able to figure out how to automate the cookie-heavy pages from, so that I could scrape-and-save my fantasy football team's results. Even with sniffing and setting the cookies properly, it still fails to reach the login-required pages. Sigh.)
posted by rokusan at 8:34 AM on October 7, 2009

Sadly I think writing an entire script is beyond my current programming knowledge. I do appreciate the suggestions though.

Is there such things as a 'mouse emulator'? I was just wondering if I could do this through something like python controlling firefox, where I save a page, and then get my mouse to click to advance through the results.
posted by a womble is an active kind of sloth at 8:56 AM on October 7, 2009

Look into the program AutoIt which is a free little scripting language that lets you automate interacting with windows, controls, etc, including stuff like "click the button named THIS" or "click the right mouse button at X=125, Y=180 in this window". It's a last resort for me when I can't script something through other means. I've used it for stuff like automating installs of programs that ask a lot of questions and require a lot of interaction to get installed.

Whatever it is the web page you're using wants you to do, you can probably get around with with curl/wget + some programming, but, it could be rather complicated and tricky.
posted by RustyBrooks at 10:05 AM on October 7, 2009 [1 favorite]

Thanks! that is an amazing program, I think it should do the trick, once I figure out how to save a web-page using it.
posted by a womble is an active kind of sloth at 10:14 AM on October 7, 2009

« Older Please share advice for non-de...   |  I had a dream about an ex-boyf... Newer »
This thread is closed to new comments.