Program to find and extract images in book scans?
October 3, 2008 7:35 AM   Subscribe

Is there a program that will automatically take scans of pages that contain text and images and then extract the images and place them in separate files? Open source would be best.

I have hundreds of pages of a scanned book that contain text and images. I'd like to have all the images grabbed out of those scans and placed in their own files, with unique names, and with some kind of indication in the image file names which original scan they came from. This is a monstrous chore to do manually, but it strikes me as something that should be fairly easy to do programmatically.
posted by Mo Nickels to Computers & Internet (8 answers total)
 
Not very likely. It is very complex to do programmatically unless all the scanned images are in the same location/size on each page.

You might be able to create an action in photoshop which could do it but it would be very complicated and may not work too well if at all.

If you want open source the best place to start searching is Sourceforge. Good luck.
posted by JJ86 at 8:08 AM on October 3, 2008


Actually, in the grand scheme of image processing/analysis this sounds relatively easy (emphasis on relatively). You'd basically be looking for rectangular areas that were mostly not white. You'd have to compensate for the rectangles not being exactly rectangular due to how the pages were scanned and deal with the vagueness of what it means to be not-white.

The problem is that this sounds real special purpose app so someone might not have written and released a tool that does what you want.
posted by mmascolino at 8:21 AM on October 3, 2008


It's going to be hard detecting images programatically. I've done this manually for several books; working from copies of the original files, open them 32 pages at a time in gimp and roughly crop out the images and save them. If there are multiple images on a page, duplicate the image in gimp before you crop. You can then - if you have a clean white background - splat the rough crops through pnmcrop to clean off the excess.

You might want to use something like unpaper to batch clean and deskew the scans. Wish it had existed when was doing this sort of thing.
posted by scruss at 8:22 AM on October 3, 2008


Also useful for digitizing all those girly magazines in the closet...
posted by mrbarrett.com at 8:23 AM on October 3, 2008


Response by poster: By the way, Adobe Acrobat Pro will do it as part of its OCR function, but its crops tend to be poor, leaving most images clipped one way or another. It's better than nothing but it's not good enough for daily use.
posted by Mo Nickels at 9:18 AM on October 3, 2008


Not sure, but would something like Evernote help?
posted by bashos_frog at 10:27 AM on October 3, 2008


Best answer: I wrote an image extractor for the Internet Archive that grabbed image coordinates from the xml output of a propriatery ocr engine. If the books you have are public domain, upload the images to archive.org and we can run them through the same ocr system. The image coordinates are very accurate.

If I was going to redo this using open source tools, I would use the layout analysis engine from Ocropus to get image coordinates.
posted by rajbot at 6:20 PM on October 3, 2008 [1 favorite]


Maybe I'm mistaken, but I think OmniPage Pro does this reasonably well. Not open source, not cheap, and no trial version, although I seem to remember them having a money-back guarantee.

Have you tried OCROpus? I don't see a feature list, so I don't know whether it purports to extract images into files, but it might be worth a try.
posted by kristi at 9:35 PM on October 3, 2008


« Older Phone in your answer, please.   |   Removing drywall anchors from the wall Newer »
This thread is closed to new comments.