Program to find and extract images in book scans?
October 3, 2008 7:35 AM Subscribe
Is there a program that will automatically take scans of pages that contain text and images and then extract the images and place them in separate files? Open source would be best.
I have hundreds of pages of a scanned book that contain text and images. I'd like to have all the images grabbed out of those scans and placed in their own files, with unique names, and with some kind of indication in the image file names which original scan they came from. This is a monstrous chore to do manually, but it strikes me as something that should be fairly easy to do programmatically.
I have hundreds of pages of a scanned book that contain text and images. I'd like to have all the images grabbed out of those scans and placed in their own files, with unique names, and with some kind of indication in the image file names which original scan they came from. This is a monstrous chore to do manually, but it strikes me as something that should be fairly easy to do programmatically.
Actually, in the grand scheme of image processing/analysis this sounds relatively easy (emphasis on relatively). You'd basically be looking for rectangular areas that were mostly not white. You'd have to compensate for the rectangles not being exactly rectangular due to how the pages were scanned and deal with the vagueness of what it means to be not-white.
The problem is that this sounds real special purpose app so someone might not have written and released a tool that does what you want.
posted by mmascolino at 8:21 AM on October 3, 2008
The problem is that this sounds real special purpose app so someone might not have written and released a tool that does what you want.
posted by mmascolino at 8:21 AM on October 3, 2008
It's going to be hard detecting images programatically. I've done this manually for several books; working from copies of the original files, open them 32 pages at a time in gimp and roughly crop out the images and save them. If there are multiple images on a page, duplicate the image in gimp before you crop. You can then - if you have a clean white background - splat the rough crops through pnmcrop to clean off the excess.
You might want to use something like unpaper to batch clean and deskew the scans. Wish it had existed when was doing this sort of thing.
posted by scruss at 8:22 AM on October 3, 2008
You might want to use something like unpaper to batch clean and deskew the scans. Wish it had existed when was doing this sort of thing.
posted by scruss at 8:22 AM on October 3, 2008
Also useful for digitizing all those girly magazines in the closet...
posted by mrbarrett.com at 8:23 AM on October 3, 2008
posted by mrbarrett.com at 8:23 AM on October 3, 2008
Response by poster: By the way, Adobe Acrobat Pro will do it as part of its OCR function, but its crops tend to be poor, leaving most images clipped one way or another. It's better than nothing but it's not good enough for daily use.
posted by Mo Nickels at 9:18 AM on October 3, 2008
posted by Mo Nickels at 9:18 AM on October 3, 2008
Not sure, but would something like Evernote help?
posted by bashos_frog at 10:27 AM on October 3, 2008
posted by bashos_frog at 10:27 AM on October 3, 2008
Best answer: I wrote an image extractor for the Internet Archive that grabbed image coordinates from the xml output of a propriatery ocr engine. If the books you have are public domain, upload the images to archive.org and we can run them through the same ocr system. The image coordinates are very accurate.
If I was going to redo this using open source tools, I would use the layout analysis engine from Ocropus to get image coordinates.
posted by rajbot at 6:20 PM on October 3, 2008 [1 favorite]
If I was going to redo this using open source tools, I would use the layout analysis engine from Ocropus to get image coordinates.
posted by rajbot at 6:20 PM on October 3, 2008 [1 favorite]
Maybe I'm mistaken, but I think OmniPage Pro does this reasonably well. Not open source, not cheap, and no trial version, although I seem to remember them having a money-back guarantee.
Have you tried OCROpus? I don't see a feature list, so I don't know whether it purports to extract images into files, but it might be worth a try.
posted by kristi at 9:35 PM on October 3, 2008
Have you tried OCROpus? I don't see a feature list, so I don't know whether it purports to extract images into files, but it might be worth a try.
posted by kristi at 9:35 PM on October 3, 2008
This thread is closed to new comments.
You might be able to create an action in photoshop which could do it but it would be very complicated and may not work too well if at all.
If you want open source the best place to start searching is Sourceforge. Good luck.
posted by JJ86 at 8:08 AM on October 3, 2008