Image PDF with OCR - how, without Acrobat?
May 25, 2011 4:50 PM   Subscribe

The full version of Adobe Acrobat has a way to OCR scanned images, so that the image is still viewed in the PDF, but you can search for text in the document. How do you do that without Acrobat?

I have a very specific project which doesn't need all of Acrobat's fun features -- just the ability to make a PDF of scanned images text-searchable. I already have the ability to take images and make a *separate* text file from OCR. What I want is Adobe's fancy ability to have the two together in a single file.

If there is an existing utility: it doesn't need to be free, I just don't think I can justify to my boss $700 for the full Acrobat for this project. Of course, free would be awesome. Command line automation would be very awesome, too.

If there's no existing utility and I have to start with a jpeg and a text file: I am also a skilled C++ and VB programmer, so if you know where there's documentation on the format, I can make a utility myself. If there's a way to make Ghostscript do this, it would be the awesomest thing ever.
posted by AzraelBrown to Computers & Internet (7 answers total) 4 users marked this as a favorite
If you Google PDF OCR you will find this as the first result. No idea how well it works, if at all.
posted by kindall at 4:53 PM on May 25, 2011

OmniPage is a decent OCR program at a very nice price.

I believe you can scan images and create searchable pdf's that you can use the free version of Acrobat to search...

I have only used it at my job to create separate .txt files from .jpeg's for ContentDM.

At home I have the full blown version of Acrobat and it definitely rocks (bought it at a hefty discount from my university while a graduate student).
posted by cinemafiend at 5:19 PM on May 25, 2011

Google docs? You can upload and download in batches iirc.
posted by idb at 6:09 PM on May 25, 2011

Best answer: Funny you should mention this as it is exactly a product I work on. My company, Atalasoft, makes components that our customers glue together into products. Included in that is a variety of OCR engines (Tesseract, GlyphReader, Iris, RecoStar) that can be used to turn scanned documents into searchable PDFs (among other things).

There is at least one precompiled demo app that will do this for you, so you don't need to write code.

I wrote the OCR engine wrappers for all of these (except Tesseract) and wrote the general OCR class hierarchy as well as all the PDF export tools.

In addition, there are hooks for setting book marks, so it's not hard to extend the app such that for every page, if you find text that matches a chapter/section header format, you could build an outline on the fly for writing into the document.

This is all Windows based, running in .NET - so it plays nice with all the .NET languages (C#, VB.NET, F#, etc).

FWIW - embedded OCR text in a PDF is a cute hack that gets done one of two ways, either you lay down the text and place the image on top or you lay down the image and place invisible text on top of that (or under - doesn't matter). The invisible text hack was put into Acrobat 1.0 (which I worked on at Adobe) for completeness more than anything else. Text could be placed as any combination of Stroked, Filled, or used as a Clipping area. One valid combination was none - invisible. It was left in and turned out to work nicely for OCR.
posted by plinth at 8:11 PM on May 25, 2011 [4 favorites]

I've been looking at PDF options now that I have a scanner with an ADF and I really like Filecenter.
posted by wongcorgi at 8:43 PM on May 25, 2011

I believe Evernote will do this.
posted by Diag at 8:50 PM on May 25, 2011

Response by poster: Thanks, plinth -- that put me on the right direction for how to look at the problem! Here's my solution, using open source commandline tools, which isn't very refined yet, but does the job in case anybody else is looking for an option:

I start with a TIF image test.tif and a text file from the OCR test.txt, and it makes the PDF test.pdf:

tiff2pdf test.tif -o image-top-layer.pdf
(Turns the TIFF into a PDF)

a2ps -M Letter --borders=no -B -R -f 10 test.txt
(that makes a letter-sized PostScript file from the txt file, no borders or headers or anything, 10pt text)

gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=text-hidden-layer.pdf
(Turns the PostScript file into a PDF)

pdftk image-top-layer.pdf background text-hidden-layer.pdf output test.pdf
(Takes the two PDFs, inserting the text layer as a "background" of the image layer, which isn't visible through the top image layer as long as the page sizes and transparency are set right)

It only nicely handles single-page images and the text isn't laid out quite right compared to the image, but it's a proof-of-concept to my boss that I actually can create a searchable-text PDF from an image in an automated, unattended way for the quarter-of-a-million pages we're going to be scanning (you can see why I want to avoid doing it manually in Acrobat!). Also, all of the above utilities come in Windows/GnuWin32 flavors, which is how I'm doing it.
posted by AzraelBrown at 11:53 AM on May 26, 2011 [3 favorites]

« Older All's sub-par in love and war.   |   Seaweed? More like PEE-weed! Newer »
This thread is closed to new comments.