Extract Text and Images From a PDF
August 28, 2008 8:11 AM
Subscribe
I'd like to extract the text and images from a multi-page pdf to use on the web.
I've got a number of large (100 to 200 pages) PDFs that I need to extract the text and images from to use with a CMS for a website. I've looked at a number of PDF to HTML software options, but they all seem to rely on CSS for positioning text, whereas I'd prefer to just have the text in simple paragraph format. If there's a PDF to HTML solution that doesn't format like this, I'd be interested in that, of course. Proper handling of tables would be quite beneficial.
The way Acrobat extracts text is decent, but I'd prefer it if there were no line breaks for individual paragraphs (although I could probably setup a PHP script to run through and fix that if it's my only option). As for the images, it'd be great if I had control over how they were saved. In a folder called PDF_NAME/PAGE_1 for example for all the page 1 images. At the very least I need to know which page they were extracted from, either by the folder or filename.
I also know I can extract images from a PDF using Photoshop, but unless there's a way to bulk extract images and store them in the format mentioned above, I don't think the solution will really work.
I'm running Windows XP, but a solution that works with Linux (Ubuntu) would be fine. If there's an absolutely amazing piece of software for OS X, I'd be inclined to find a system I could use to run it. I'm also open to any web-based solutions (maybe I could use PHP for such a task?).
posted by backwards guitar to computers & internet (2 comments total)
1 user marked this as a favorite
posted by ukdanae at 8:37 AM on August 28, 2008