How to automatically extract graphical content from PDFs?
May 9, 2007 1:33 PM   Subscribe

Are there any software packages or toolkits (preferably open source) available that allow me to automatically extract graphical content (such as pictures, diagrams, graphs, etc.) from batches of PDFs?

I'm working on a Grad school project where I would like to automatically extract any graphical content from batches of PDFs.
By graphical content, I mean pictures, graphs, diagrams... anything that's visual and not part of the full text.

I would also like to be able to automatically extract any captions that a picture would have, and perhaps the surrounding text... say half a page before and after the occurrence of the picture.

I'm trying to build a set of pictures from a large batch PDFs, and classify/tag them based on the content of the captions or nearby text.

Thanks for any help!
posted by elbaso to Computers & Internet (4 answers total) 3 users marked this as a favorite
 
pdfimage, part of the Xpdf project from foolabs is open source and should do what you want.
posted by roue at 1:46 PM on May 9, 2007


Printscreen.
Take screenshots of the window that contains the data using Ctrl-Alt-PrntScrn, then paste (ctrl-v) into a document. If you're using MS Office, you should be able to crop out everything you don't need. Then just hold down shift to constrain proportions and enlarge the image to fill the space you need it to fill.
posted by ijoyner at 10:33 AM on May 10, 2007


@roue: Thanks, pdfimage looks like what I need.
I'll download it, and see how it works.
posted by elbaso at 12:08 PM on May 10, 2007


@ijoyner: Thanks for the suggestion, but that solution doesn't really work for me.
I need a way to extract images from large batches of files automatically, since I'll be dealing in hunderds, perhaps thousands of PDFs.
posted by elbaso at 12:10 PM on May 10, 2007


« Older Help me find a field trip location for tech-savvy...   |   How do I grant my hard drive permission to boot up... Newer »
This thread is closed to new comments.