Automatically PDF a website?
March 29, 2006 11:31 AM   Subscribe

How can I automate printing a complete website to PDF?

I need to be able to show what the site looked like on any given date. The site is a sales tool and therefore requires user input on every page, so simply spidering the site is not adequate.

Because the site requires user input all over the place, I'd somehow need to supply the tool with data so it could be fully automated.

Are there tools out there that could do this?

Note: It doesn't necessarily have to be output to PDF, just something manageable and static.
posted by gfroese to Computers & Internet (8 answers total)
 
I beleiev this will do what you want: SnagIt - buy it here
posted by AuntLisa at 11:41 AM on March 29, 2006


Response by poster: SnagIt isn't exactly what I'm looking for.

The tool I need needs to be able to navigate through many pages of a website (following a pre-determined path) and automatically PDF each page along the way. SnagIt just appears to handle the current page.
posted by gfroese at 12:05 PM on March 29, 2006


Enjoy! It even has a batch mode, so you don't even have to use a weird command like find to get it going on multiple files. Dunno if it's available for the 'doze. :-D
posted by shepd at 12:48 PM on March 29, 2006


I do interactive grabbing like this occasionally (I use a Mac, if that makes any difference to you). I use QuicKeys, but others have told me that Keyboard Maestro (one-fifth the cost of QuicKeys) lets them do the same things.

Though I don't have a stake in either (or in Macs, even—Startly Technologies makes a Windows version of QuicKeys, too), I would recommend QuicKeys over Keyboard Maestro because QuicKeys can deal with much more complex conditional events and instructions (e.g., waiting for a web page to load certain text fields and a button named 'Submit', or, say, a button that directs you to a page with certain strings of text in its URL, so that if you lose your internet connection or are redirected from the pages your working with on some website, you won't get reams of garbage PDF captures), can do amazing things with variables, and interacts flawlessly with graphical interfaces and that sort of "I can see it but I can't make a keyboard-macro application see it" thing.
posted by Yeomans at 1:16 PM on March 29, 2006


if you're not opposed to "possibly hard", curl will do what you want. it can be configured to work HTML forms (HTTP GET and POST) and stuff. you may have to look at the source for each page to figure out what to pass to curl. comes with most Linux/Unix distros and Mac OS X, and you can get it for Windows. it will output HTML files. (I only know it can do this kinda stuff - a coworker's using it to automate a web-based application type thingy - but I dunno how, so I can't give you any examples.)

alternatively, you could use the curl support in PHP to do the same thing and feed it your pre-determined path as an XML document. you could do something similar with just a shell script or something too.
posted by mrg at 1:44 PM on March 29, 2006


You might also want to look at archive.org and their way-back machine. They keep daily copies of most websites.
posted by blue_beetle at 1:54 PM on March 29, 2006


How many pages are involved? If the site involves user input, and the user input is free-form, then surely the possible pages are effectively infinite?

But assuming you're just selecting from menus and hitting buttons, this is the kind of thing which could easily be done with a Perl module called WWW::Mechanize.

The code would simply have to navigate to the site, fill in and submit the form, then save the resulting HTML, lather, rinse repeat. I'm imagining you'd have a folder for each date and/or time, so if you had
domain.com/frontpage.html
domain.com/contact.html
domain.com/search.html
domain.com/searchresults.cgi
you'd end up with a bunch of files on your HD like
March-30/domain.com/frontpage.html
March-30/domain.com/contact.html
March-30/domain.com/search.html
March-30/domain.com/searchresults.html

April-1/domain.com/frontpage.html
April-1/domain.com/contact.html
April-1/domain.com/search.html
April-1/domain.com/searchresults.html
and so on. You could browse them just like any other HTML files, given a bit of tweaking.
posted by AmbroseChapel at 2:07 PM on March 29, 2006


Response by poster: Badboy looks like it might do just what I need. Still playing with it.
posted by gfroese at 6:50 AM on March 31, 2006


« Older How badly can golf ball sized hail damage a roof?   |   Katrina Filter Newer »
This thread is closed to new comments.