Turning HTML into a book?
December 20, 2006 7:26 PM   Subscribe

I'm trying to turn someone's blog into annual books for them as a gift (not Christmas). What's the best way to do this?

The book printer wants a PDF. I've already captured the HTML for the blog into files, and have written a perl script to parse the HTML into a basic structured text file that identifies the title, date, and other blog metadata in a standardized way.

My specific question: how can I import this text (with some HTML content in the bodies) into a word processor or other page layout program so that (a) the (simple) HTML formatting inside the blog content is preserved, and (b) styles are automatically assigned to the title, date and such so that I don't have to manually do it?

  • I'm flexible about the layout software, but if it's not MS Word 2003 or Publisher, it needs to be free (and able to handle 200-300 page manuscripts in a single file).
  • The structure of the file to import is flexible, since I'm creating it... for example, creating XML would not be difficult.
  • I'm aware there are numerous HTML -> PDF options, but I really want to use a WYSIWYG style layout program that supports TOC creation, page numbering, and so on.
  • The work to automate can't be too elaborate, since this is a one-off (there are about 600 blog entries spread out over three books).
Any suggestions?
posted by reborndata to Computers & Internet (15 answers total) 1 user marked this as a favorite
Word will open HTML files, why not just do that, then save as a .doc and work on it in Word?

You'll probably have to be a bit wary of its interpretation of HTML, (you might have tweak your perl script and re-export a couple of times perhaps) but it should work.
posted by AmbroseChapel at 8:12 PM on December 20, 2006

latex!! it's perfect - assuming it's just text, and not heavy with images or other stuff. it's not wysiwyg, in that you don't have precise control over the style. it works by 'compiling' tagged text into beautifully typeset documents. table of contents and sections and stuff are trivial. it is built for automation and large documents. I really think it's exactly what you need. many free/open source tools exist. you could probably do the whole thing with one script.

see here, here, and other stuff here
posted by PercussivePaul at 8:43 PM on December 20, 2006

Best answer: You could use one of the free DocBook convertors to convert your HTML to DocBook, and then one of the free DocBook publishers to convert your DocBook to PDF.
posted by scottreynen at 8:52 PM on December 20, 2006

You seem like a programmer type - if so, you may find FPDF to be useful for this purpose. It's a minimalist PHP library for writing PDF's that contains very intuitive handling of margins, line breaks, page breaks, and standard headers & footers (the things I originally assumed would be very complicated when I first experimented with PDF generation). CSS-like text styles would be handled by writing methods that set font characteristics and so on.
posted by migurski at 10:08 PM on December 20, 2006

Best answer: Not exactly what you're looking for, but Blurb has a tool that claims to do all this automagically (but is presumably locked into their book-making service).
posted by stavrosthewonderchicken at 10:12 PM on December 20, 2006

Best answer: how about openoffice? i know it can export straight to PDF and it certainly fits your free criteria.
posted by moochoo at 1:02 AM on December 21, 2006

posted by IronLizard at 1:05 AM on December 21, 2006

You might like this option... get the new Internet Explorer 7, then click on the pull-down menu "Page" and select "Edit with Microsoft Word". Its Magic! Loads the whole thing into Word preserving most of the formating and images. For the blog I tried, it was as good as I could hope for.
posted by cabb-chase at 6:46 AM on December 21, 2006

I can't tell if the advice above is meant to be a joke or not.

If so, stop. If not, I'm sorry, but please lurk more, cabb-chase. That's fucking cretinous.
posted by stavrosthewonderchicken at 6:56 AM on December 21, 2006

Response by poster: These are some great suggestions!

The Blurb blog book software looks interesting. The blog currently isn't in any of the formats they support, I could generate WordPress import format easily, so I could set up a temporary wordpress version of the blog and then import into Blurb. I was going to use Lulu to publish, but if the blurb template is attractive, this is a good option.

The other really helpful suggestion was Docbook... I wasn't aware of it before, but I'm pretty sure I could easily generate valid docbook markup in my perl script, and then use one of the several Docbook -> OpenOffice import tools and retain the style information. The question is whether the HTML in the blog content would come through properly...

posted by reborndata at 6:58 AM on December 21, 2006

You already have answers, but wanted to throw CutePDF out there. It gives you a PDF printer diver (just like Acrobat's / OS X's), so if you can print it, you can PDF it. I used it for a year or two, and my parents use it all the time. Very easy to use, and it's free.
posted by niles at 1:37 PM on December 21, 2006

cabb-chase, don't let stavros put you off. Welcome to MetaFilter.

(If you end up being a Microsoft shill, I take that back.)

Seriously, I thought cabb-chase's suggestion wasn't bad.
posted by Alt F4 at 2:27 PM on December 21, 2006

cabb-chase, don't let stavros put you off. Welcome to MetaFilter.

Shit, that was his first post? OK, I apologize. It was still dumb, but I needn't have been so blunt about it.
posted by stavrosthewonderchicken at 6:36 PM on December 21, 2006

Response by poster: I want to address some of the other suggestions for the sake of posterity. Although they aren't relevant to my specific project, they may be for others.

HTML -> Word
This doesn't work for me because I want to automatically apply unique styles specfically to certain text like the blog titles, dates, comments, and so forth. If the blog text didn't contain HTML I could do some mapping like H1 = Title, H2 = date, and so forth. But since that's not the case, (the articles do contain a variety of header tags), using HTML tags to represent Word styles won't work. This might work fine if someone is less concerned about fine control over format in the final output, or willing to put in the time to assign styles manually.

In college I actually used LaTeX (running on my NeXTstation!) for papers, and I loved it. The problem is that the recipient knows what LaTeX output looks like, and despises it (or at least did 12 years ago). Most LaTeX templates scream "peer reviewed research paper" and the blog is of a more artistic bent. I don't have the time to learn TeX well enough to customize the appearance sufficiently. Also, there's the general issue that the blog content is in HTML (sometimes used to rather creative purpose), so getting a faithful rendition of the content in LaTeX may be challenging.

Well, I could maybe have been called a programmer at one time in the distant past, but I don't know PHP and learning it is beyond the scope of this project. Also, I'm not confident I could attain the level of layout quality I'm shooting for in a programmatic approach... I'm more a point and click guy when it comes to visual design.

Looks interesting... but my layout won't be very sophisticated, and handling of DocBook seems to be slim to non-existant. I'll definitely file it away if I ever am trying to design something with more variation from page to page.

IE7 export
Answering as if this weren't a troll: the whole point of this exercise is to avoid manually handling each posting. There are over 600, commonly with 15+ comments. I'm also not trying to re-create the look of the blog in the book... to the contrary, I want it to look more like a book than a blog.

Getting to PDF is not the hard part. But CutePDF certainly deserves plugging, as well as PDFCreator.

Thanks again for all the ideas!
posted by reborndata at 5:34 AM on December 22, 2006

In my defense I did download a couple of blogs. One came into Word as the entire 3 years of blogging with each field associated with its particular style. 300+ pages with images, but sadly no comments. Maybe the step I missed to tell you about was that you must save the imported blog as a Word document (not as HTML).
posted by cabb-chase at 6:14 AM on December 22, 2006 [1 favorite]

« Older Who did the beautiful concept art for Elebits?   |   I want to volunteer on Christmas in NYC. Newer »
This thread is closed to new comments.