How do I seek and destroy image-only PDF's?
December 16, 2008 10:47 PM Subscribe
Is there Mac-based software available that will search my entire hard drive or designated folders for image-only PDF files (have not been OCR'ed) and then automatically run OCR (using Acrobat Pro or whatever) and override the original file with a searchable version?
I am in the process of scanning all of my personal and business files. I have been very pleased with the results so far and love the ease of using Spotlight or Google Desktop to locate searchable PDF files, etc.
However, I have hundreds of older PDF's that are image-only randomly scattered in different folders. Currently, I have been opening each questionable PDF and manually checking whether it is image-only or searchable. If the PDF is image-only I manually run OCR using Acrobat Pro and then save the searchable version over the original file.
I am looking for a way to automate this tedious process. So far, I have only been able to find scripts and the like that will allow you to batch process groups of PDF's. I am looking for something that will "search and destroy" on its own.
I have a MacBook running OS X 10.5.6, Adobe Acrobat Pro 8 and a Fujitsu ScanSnap scanner.
I am in the process of scanning all of my personal and business files. I have been very pleased with the results so far and love the ease of using Spotlight or Google Desktop to locate searchable PDF files, etc.
However, I have hundreds of older PDF's that are image-only randomly scattered in different folders. Currently, I have been opening each questionable PDF and manually checking whether it is image-only or searchable. If the PDF is image-only I manually run OCR using Acrobat Pro and then save the searchable version over the original file.
I am looking for a way to automate this tedious process. So far, I have only been able to find scripts and the like that will allow you to batch process groups of PDF's. I am looking for something that will "search and destroy" on its own.
I have a MacBook running OS X 10.5.6, Adobe Acrobat Pro 8 and a Fujitsu ScanSnap scanner.
I believe Evernote will index and scan PDFs (and image files like JPGs) for text and make the lot searchable.
posted by Happy Dave at 11:52 PM on December 16, 2008
posted by Happy Dave at 11:52 PM on December 16, 2008
The beginnings of an approach in totally untested ruby.
files = Dir["**.pdf"]
files.each do |file|
content = `pdftotext #{file}`
next if content
## Now fire up the OCR job, and shuffle files around
end
It'd take a fair amount of tweaking and playing to make it all work, but that's the approach I'd use. The pdftotext program doesn't come with OSX, but can be installed via macports, via the xpdf package.
posted by cschneid at 10:18 AM on December 18, 2008
files = Dir["**.pdf"]
files.each do |file|
content = `pdftotext #{file}`
next if content
## Now fire up the OCR job, and shuffle files around
end
It'd take a fair amount of tweaking and playing to make it all work, but that's the approach I'd use. The pdftotext program doesn't come with OSX, but can be installed via macports, via the xpdf package.
posted by cschneid at 10:18 AM on December 18, 2008
« Older Movies yet to be released on DVD? | Should water that's been distilled through reverse... Newer »
This thread is closed to new comments.
posted by suedehead at 11:48 PM on December 16, 2008