Archiving a blog to PDF or ePUB
July 4, 2013 7:26 AM   Subscribe

I'd like to archive a recently-deceased friend's blog to PDF or ePUB in chronological order. It's a Blogger site, and has a couple of years of articles.

I'd like to be able to start here, and have each new article start on a new page, with the date clearly shown. It's not my blog, so I can't log in and export entries. I've gone through the previous AskMes on this, but most of them are old, and the link-rot has got to them.

I'm on Linux, but can use Mac. Free greatly preferred. I'm not afraid of the command line.
posted by scruss to Computers & Internet (7 answers total) 5 users marked this as a favorite
 
Using brute force, you could visit each page in Chrome and use the save to PDF print function.

Or, if you are brave, you could try using wget from the command line using the recursive option and archive the entire site as html then later try to convert each page to a format of your liking.

Calibre also has very good ePUB conversion facilities, I don't remember if it can do a whole site but you could convert it page by page. It also has various command line utilities that may help, possibly in conjunction with wget.

Best of luck and my sympathies.
posted by beowulf573 at 8:23 AM on July 4, 2013


I guess what I would do is use wget or, apparently, xidel to get the list of URLs. Then feed them to PhantomJS in a tiny shell script to render them as PDFs. Then use some PDF merge tool like pdftk. I haven't stepped through it to ensure those exact recipes work for me, but each of those steps should be easy to google up using similar terms to get a quick and dirty solution that mainly involves the command line.

It is probably not hard to script the whole thing in PhantomJS too, if you're comfortable with JavaScript. Actually, even using the first solution, you might want to use a little JavaScript/jQuery to inject or re-write content. Here are some relevant code examples.
posted by Monsieur Caution at 8:28 AM on July 4, 2013 [2 favorites]


The plan I would follow under Mac OS X: SiteSucker to vacuum up all of the wanted pages, then import them into VoodooPad. They can be exported to PDF from there if needed.
posted by megatherium at 9:12 AM on July 4, 2013


It depends on how enthusiastic you are about doing manual copy-and-paste, but Readlists gives you a very nicely formatted epub, in my experience.
posted by rtha at 9:25 AM on July 4, 2013


Response by poster: I've already grabbed it with wget:
 wget --mirror -p --html-extension --convert-links -P .  --wait 1 URL
Looks like Blogger HTML is utterly vile: all <span>s and no paragraphs. I've a feeling that even a permissive parser like Beautiful Soup will choke.
posted by scruss at 6:51 PM on July 4, 2013


Here's one approach, using a Mac and the free application Paparazzi!. Not ideal, but maybe if nothing else comes along:

1. Create a list of the page URLs. Two potential ways of doing that: (a) look at the page source in a text editor, find the list of URLs near the "Blog archive" heading that appears on the left-hand side, copy the relevant HTML text, do find-replace operations to remove everything but the URLs, save to a file. (b) Looking at the blog page you linked, it seems the list of pages under "Blog archive" only numbers 75, so you could actually just visit each page manually and copy the URL, then append it to a file.

2. Bring up each page in Paparazzi! to create the PDF and save it. Some potential ways of doing that: (a) If you know shell scripting, write a shell script to loop over each line in the file of URLs you saved and execute open -a '/Applications/Paparazzi\!.app' on each URL in turn (and use single quotes, not double quotes, or the exclamation point will be interpreted differently by the shell). (b) If you know how to use Automator, you could write an Automator workflow to loop over the contents of the file of URLs and invoke Paparazzi!. (c) If none of those options are suitable, you could manually open each URL in Paparazzi!.

If you would end up doing 1b and 2c, you could short-circuit things to skip saving the URLs to a file, and instead, use Paparazzi!'s menu option for "Capture URL From ..." to get the currently-visible page in the browser, rather than copy-pasting. You'll have to switch back and forth between the browser and the app, or write an Automator sequence, or (what I do) define a shortcut for the Paparazzi! action using something like QuicKeys or Keyboard Maestro so that you can invoke it directly while you're viewing a page in your browser.

One thing to note which you may or may not like: Paparazzi! produces a single-page PDF of the whole page. That's nice for historical archiving purposes because it saves exactly what the page looks like, without page breaks. This is almost unique to Paparazzi!, and the reason I suggested this program. But there's a downside to single-page PDFs: printing is more challenging. Presumably there's some software that will allow you to "slice up" the PDF in some way, but I haven't found one yet (though I also haven't looked hard). OTOH, if all you ever plan to do is view the pages in a PDF viewer, it's not too hard to zoom and scroll, though sometimes a bit annoying, so if printing is not an issue, then it's not a problem.
posted by StrawberryPie at 9:39 AM on July 5, 2013


By strange coincidence, I just found wkhtmltopdf. Haven't tried it, but it sounds like a potential command-line alternative to Paparazzi!.
posted by StrawberryPie at 5:57 PM on July 5, 2013


« Older Controversial Subject   |   Trying to find name of independent film about solo... Newer »
This thread is closed to new comments.