How do I seek and destroy image-only PDF's?
December 16, 2008 10:47 PM   Subscribe

Is there Mac-based software available that will search my entire hard drive or designated folders for image-only PDF files (have not been OCR'ed) and then automatically run OCR (using Acrobat Pro or whatever) and override the original file with a searchable version?

I am in the process of scanning all of my personal and business files. I have been very pleased with the results so far and love the ease of using Spotlight or Google Desktop to locate searchable PDF files, etc.

However, I have hundreds of older PDF's that are image-only randomly scattered in different folders. Currently, I have been opening each questionable PDF and manually checking whether it is image-only or searchable. If the PDF is image-only I manually run OCR using Acrobat Pro and then save the searchable version over the original file.

I am looking for a way to automate this tedious process. So far, I have only been able to find scripts and the like that will allow you to batch process groups of PDF's. I am looking for something that will "search and destroy" on its own.

I have a MacBook running OS X 10.5.6, Adobe Acrobat Pro 8 and a Fujitsu ScanSnap scanner.
posted by randex8 to Computers & Internet (4 answers total) 1 user marked this as a favorite
Best answer: Devonthink will create a database. You don't have to import them -- you can just index them, but once you do that it'll OCR them and save them. In addition, it has this very nifty algorithm where it'll find documents that are similar to another document. It seems that it supports the ScanSnap pretty well.
posted by suedehead at 11:48 PM on December 16, 2008

I believe Evernote will index and scan PDFs (and image files like JPGs) for text and make the lot searchable.
posted by Happy Dave at 11:52 PM on December 16, 2008

The beginnings of an approach in totally untested ruby.

files = Dir["**.pdf"]
files.each do |file|
content = `pdftotext #{file}`
next if content

## Now fire up the OCR job, and shuffle files around

It'd take a fair amount of tweaking and playing to make it all work, but that's the approach I'd use. The pdftotext program doesn't come with OSX, but can be installed via macports, via the xpdf package.
posted by cschneid at 10:18 AM on December 18, 2008

Response by poster: Thanks for the quick replies!
posted by randex8 at 7:58 PM on December 18, 2008

« Older Movies yet to be released on DVD?   |   Should water that's been distilled through reverse... Newer »
This thread is closed to new comments.