Not Planning Ahead Sucks
April 1, 2005 7:48 PM   Subscribe

I have 12 issues of an image intensive 'zine in PDF. I've slowly been converting them to XHTML by hand. I'm halfway through the 5th issue and I'm sick of it. Is there any easy, or less difficult, way to convert these?
posted by Captaintripps to Computers & Internet (5 answers total)
 
You could try pdf2html, http://pdftohtml.sourceforge.net/. I don't know how good it is, the output might require some editing, but it should help a lot. There's several versions, so searching for pdf2html might find other, maybe better, versions.
posted by Boobus Tuber at 3:34 AM on April 2, 2005


For example, this (Intrapdf trial version), or this (Adobe online conversion).
posted by Boobus Tuber at 3:38 AM on April 2, 2005


Acrobat can export to HTML, though it is not valid code. (HTML Tidy can assist with that.)

Platform? Number of pages? Multicolumn layouts? More info, please. Example if possible.
posted by joeclark at 6:19 AM on April 2, 2005


Sorry about not including my system specifications. I'm running Mac OSX, though I do have access to an XP box.

Adobe's online conversion simply doesn't cut it. It rearranges the formatting and leaves the areas for the images out. I get 23 pages of text, essentially.

Acrobat's export to HTML has never really worked properly, either. It's not just that it needs tweaking, it basically mucks up the formatting to the point where tweaking it doesn't save any time.

Again, I have no problem going in and fixing up some things, but the output has to be usable.

The 'zines are about 25 pages.

Examples:

PDF version

HTML version
posted by Captaintripps at 7:38 AM on April 2, 2005


All right, looking at your sample PDF file I see that it was created in PageMaker 7. (How retro!) PM7 can produce a tagged PDF. Do another export with tagging turned on, and no, I don't know which little button to tick, but it won't be hard to find. (On checking more closely in Acrobat's HTML export settings, it converts to tagged PDF itself. Nonetheless, give 'er a whirl.)

In Acrobat, an export to HTML or even plain text (the choice that isn't marked "Accessible" tends to be cleaner) will now quite probably work better.

I tried HTML+CSS export and the results are not really that bad. Your use of pictures of text for headlines is a complication. You have to write alt texts for your images. Some search-and-replace in BBEdit (also use of Tidy) will improve things. Don't stop till you've got actual valid code.

Twelve issues are not that many issues. Don't try to do them all at once, but the job will gradually get done.
posted by joeclark at 3:00 PM on April 2, 2005


« Older Sleep quality   |   Strawberries and Food Dye Newer »
This thread is closed to new comments.