Extract Text and Images From a PDF
August 28, 2008 8:11 AM   Subscribe

I'd like to extract the text and images from a multi-page pdf to use on the web.

I've got a number of large (100 to 200 pages) PDFs that I need to extract the text and images from to use with a CMS for a website. I've looked at a number of PDF to HTML software options, but they all seem to rely on CSS for positioning text, whereas I'd prefer to just have the text in simple paragraph format. If there's a PDF to HTML solution that doesn't format like this, I'd be interested in that, of course. Proper handling of tables would be quite beneficial.

The way Acrobat extracts text is decent, but I'd prefer it if there were no line breaks for individual paragraphs (although I could probably setup a PHP script to run through and fix that if it's my only option). As for the images, it'd be great if I had control over how they were saved. In a folder called PDF_NAME/PAGE_1 for example for all the page 1 images. At the very least I need to know which page they were extracted from, either by the folder or filename.

I also know I can extract images from a PDF using Photoshop, but unless there's a way to bulk extract images and store them in the format mentioned above, I don't think the solution will really work.

I'm running Windows XP, but a solution that works with Linux (Ubuntu) would be fine. If there's an absolutely amazing piece of software for OS X, I'd be inclined to find a system I could use to run it. I'm also open to any web-based solutions (maybe I could use PHP for such a task?).
posted by backwards guitar to Computers & Internet (2 answers total) 1 user marked this as a favorite
I've been through tons of PDF converters, and the one that i liked the most is ABBYY PDF Transformer - despite the dodgy name, it's really useful as it allows you to select which regions of the PDF you'd like to convert, allows you to dump to simple paragraphs, and it converts to a variety of formats. I'm not sure about your advanced options, as my trial has finished now. It's worth downloading the trial, however, if you haven't already.
posted by ukdanae at 8:37 AM on August 28, 2008

The images part is easy. Acrobat 7 Pro has a "Advanced>Export All Images..." command that names them by page and then by a serial number.

Getting nicely formatted text from PDFs is the tricky part. Are the PDFs tagged for XML? That would be an easy workflow to set up. Otherwise, you might need to go at the text manually with a word processor and macros (what I use) or with some tailor-made code.
posted by cowbellemoo at 8:42 AM on August 28, 2008

« Older Running a data only website?   |   Seeking Family Reunion Destination Ideas Newer »
This thread is closed to new comments.