Archiving a blog to PDF or ePUB
July 4, 2013 7:26 AM Subscribe
I'd like to archive a recently-deceased friend's blog to PDF or ePUB in chronological order. It's a Blogger site, and has a couple of years of articles.
I'd like to be able to start here, and have each new article start on a new page, with the date clearly shown. It's not my blog, so I can't log in and export entries. I've gone through the previous AskMes on this, but most of them are old, and the link-rot has got to them.
I'm on Linux, but can use Mac. Free greatly preferred. I'm not afraid of the command line.
I'd like to be able to start here, and have each new article start on a new page, with the date clearly shown. It's not my blog, so I can't log in and export entries. I've gone through the previous AskMes on this, but most of them are old, and the link-rot has got to them.
I'm on Linux, but can use Mac. Free greatly preferred. I'm not afraid of the command line.
I guess what I would do is use wget or, apparently, xidel to get the list of URLs. Then feed them to PhantomJS in a tiny shell script to render them as PDFs. Then use some PDF merge tool like pdftk. I haven't stepped through it to ensure those exact recipes work for me, but each of those steps should be easy to google up using similar terms to get a quick and dirty solution that mainly involves the command line.
It is probably not hard to script the whole thing in PhantomJS too, if you're comfortable with JavaScript. Actually, even using the first solution, you might want to use a little JavaScript/jQuery to inject or re-write content. Here are some relevant code examples.
posted by Monsieur Caution at 8:28 AM on July 4, 2013 [2 favorites]
It is probably not hard to script the whole thing in PhantomJS too, if you're comfortable with JavaScript. Actually, even using the first solution, you might want to use a little JavaScript/jQuery to inject or re-write content. Here are some relevant code examples.
posted by Monsieur Caution at 8:28 AM on July 4, 2013 [2 favorites]
The plan I would follow under Mac OS X: SiteSucker to vacuum up all of the wanted pages, then import them into VoodooPad. They can be exported to PDF from there if needed.
posted by megatherium at 9:12 AM on July 4, 2013
posted by megatherium at 9:12 AM on July 4, 2013
It depends on how enthusiastic you are about doing manual copy-and-paste, but Readlists gives you a very nicely formatted epub, in my experience.
posted by rtha at 9:25 AM on July 4, 2013
posted by rtha at 9:25 AM on July 4, 2013
Response by poster: I've already grabbed it with wget:
posted by scruss at 6:51 PM on July 4, 2013
wget --mirror -p --html-extension --convert-links -P . --wait 1 URLLooks like Blogger HTML is utterly vile: all <span>s and no paragraphs. I've a feeling that even a permissive parser like Beautiful Soup will choke.
posted by scruss at 6:51 PM on July 4, 2013
Here's one approach, using a Mac and the free application Paparazzi!. Not ideal, but maybe if nothing else comes along:
1. Create a list of the page URLs. Two potential ways of doing that: (a) look at the page source in a text editor, find the list of URLs near the "Blog archive" heading that appears on the left-hand side, copy the relevant HTML text, do find-replace operations to remove everything but the URLs, save to a file. (b) Looking at the blog page you linked, it seems the list of pages under "Blog archive" only numbers 75, so you could actually just visit each page manually and copy the URL, then append it to a file.
2. Bring up each page in Paparazzi! to create the PDF and save it. Some potential ways of doing that: (a) If you know shell scripting, write a shell script to loop over each line in the file of URLs you saved and execute open -a '/Applications/Paparazzi\!.app' on each URL in turn (and use single quotes, not double quotes, or the exclamation point will be interpreted differently by the shell). (b) If you know how to use Automator, you could write an Automator workflow to loop over the contents of the file of URLs and invoke Paparazzi!. (c) If none of those options are suitable, you could manually open each URL in Paparazzi!.
If you would end up doing 1b and 2c, you could short-circuit things to skip saving the URLs to a file, and instead, use Paparazzi!'s menu option for "Capture URL From ..." to get the currently-visible page in the browser, rather than copy-pasting. You'll have to switch back and forth between the browser and the app, or write an Automator sequence, or (what I do) define a shortcut for the Paparazzi! action using something like QuicKeys or Keyboard Maestro so that you can invoke it directly while you're viewing a page in your browser.
One thing to note which you may or may not like: Paparazzi! produces a single-page PDF of the whole page. That's nice for historical archiving purposes because it saves exactly what the page looks like, without page breaks. This is almost unique to Paparazzi!, and the reason I suggested this program. But there's a downside to single-page PDFs: printing is more challenging. Presumably there's some software that will allow you to "slice up" the PDF in some way, but I haven't found one yet (though I also haven't looked hard). OTOH, if all you ever plan to do is view the pages in a PDF viewer, it's not too hard to zoom and scroll, though sometimes a bit annoying, so if printing is not an issue, then it's not a problem.
posted by StrawberryPie at 9:39 AM on July 5, 2013
1. Create a list of the page URLs. Two potential ways of doing that: (a) look at the page source in a text editor, find the list of URLs near the "Blog archive" heading that appears on the left-hand side, copy the relevant HTML text, do find-replace operations to remove everything but the URLs, save to a file. (b) Looking at the blog page you linked, it seems the list of pages under "Blog archive" only numbers 75, so you could actually just visit each page manually and copy the URL, then append it to a file.
2. Bring up each page in Paparazzi! to create the PDF and save it. Some potential ways of doing that: (a) If you know shell scripting, write a shell script to loop over each line in the file of URLs you saved and execute open -a '/Applications/Paparazzi\!.app' on each URL in turn (and use single quotes, not double quotes, or the exclamation point will be interpreted differently by the shell). (b) If you know how to use Automator, you could write an Automator workflow to loop over the contents of the file of URLs and invoke Paparazzi!. (c) If none of those options are suitable, you could manually open each URL in Paparazzi!.
If you would end up doing 1b and 2c, you could short-circuit things to skip saving the URLs to a file, and instead, use Paparazzi!'s menu option for "Capture URL From ..." to get the currently-visible page in the browser, rather than copy-pasting. You'll have to switch back and forth between the browser and the app, or write an Automator sequence, or (what I do) define a shortcut for the Paparazzi! action using something like QuicKeys or Keyboard Maestro so that you can invoke it directly while you're viewing a page in your browser.
One thing to note which you may or may not like: Paparazzi! produces a single-page PDF of the whole page. That's nice for historical archiving purposes because it saves exactly what the page looks like, without page breaks. This is almost unique to Paparazzi!, and the reason I suggested this program. But there's a downside to single-page PDFs: printing is more challenging. Presumably there's some software that will allow you to "slice up" the PDF in some way, but I haven't found one yet (though I also haven't looked hard). OTOH, if all you ever plan to do is view the pages in a PDF viewer, it's not too hard to zoom and scroll, though sometimes a bit annoying, so if printing is not an issue, then it's not a problem.
posted by StrawberryPie at 9:39 AM on July 5, 2013
By strange coincidence, I just found wkhtmltopdf. Haven't tried it, but it sounds like a potential command-line alternative to Paparazzi!.
posted by StrawberryPie at 5:57 PM on July 5, 2013
posted by StrawberryPie at 5:57 PM on July 5, 2013
This thread is closed to new comments.
Or, if you are brave, you could try using wget from the command line using the recursive option and archive the entire site as html then later try to convert each page to a format of your liking.
Calibre also has very good ePUB conversion facilities, I don't remember if it can do a whole site but you could convert it page by page. It also has various command line utilities that may help, possibly in conjunction with wget.
Best of luck and my sympathies.
posted by beowulf573 at 8:23 AM on July 4, 2013