Looking for an OCR program that will handle batch processing and columns automatically.
November 28, 2011 1:47 PM   Subscribe

We're looking for an OCR program that will handle batch processing and columns automatically.

We need to OCR about a half million scanned newspaper pages of English language material. We already have the images in tiff format so we would like the OCR program to handle batches and be able to recognize columns automatically. Ideally, we want this process to be as automated as possible and minimize the amount of hands-on work we'll need to do with the material. I'd appreciate input from anyone who has worked with OCR software on a similarly-scoped project.

We're planning to look into the free OCR software Tesseract, but we are not averse to spending some money if it means quality and less manual tweaking. This project will be grant-funded so we should be able to work reasonable software costs into the grant.
posted by pahool to Computers & Internet (5 answers total) 2 users marked this as a favorite
We used OmniPage for a similar thing - mostly books, but some newspapers too. It's not perfect, but its accuracy is great, it's trainable, it handles columns pretty well without needing them to be manually drawn in and it'll happily chug through several hundred images on its own. The only batch managing experience I have with it is basic 'process all images from this folder', although its batch manager will do some more sophisticated things too.
posted by Catseye at 2:13 PM on November 28, 2011

We're using PDFCompressor http://www.cvisiontech.com/products/general/pdfcompressor.html at work to convert PDF Documents that previously had the text as images to selectable text. Tesseract output is garbage in comparison to the quality of the output of this software.
posted by DetriusXii at 2:32 PM on November 28, 2011 [1 favorite]

Tesseract won't give you columns. You will have to intuit them from the content returned from the engine. The quality of recognition is not particularly good and the engine does unattended learning which makes its recognition often get worse (!!).

My company, Atalasoft, among other things, makes interfacing for OCR engines in C# that operate consistently making it possible to compare them. You might get a trial on the GlyphReader engine and see how it works for you - it has a flag for setting up columnar recognition and is one of the few engines I've seen that will do that specifically, although I seem to recall that the Abbyy engine (which we no longer support) does pretty well with columnar layout. The downside is that the GlyphReader has an internal limit on the total page dimensions. It really wants letter/legal/A4 and a typical newspaper scanned at a decent resolution may exceed that.
posted by plinth at 5:11 PM on November 28, 2011

recall that the Abbyy engine (which we no longer support) does pretty well with columnar layout.

I use the Abbyy commercial product. While it sounds like this project is pretty far above what I do, one thing that I have noticed about Abbyy is its ability to recognize columns and rows better than other OCR engines. In particular, it has the ability to recognize blocks of text that are of the same column or row and isolate them from other text that is a different sections altogether. It also does great with spreadsheets which are just hyper-column-row arrangements.

On the downside, Abbyy is sometimes not sophisticated enough to detect a column (or row) if there is not a clear delineation of that column (or row) , like a vertical line or a consistent end margin of text. It really depends on the quality of the source or even how the source was typeset in the first place.
posted by lampshade at 6:02 PM on November 28, 2011

I'm only 90% sure it will do column layouts, but Acrobat Pro has AWESOME OCR. I did a bit of research recently, and it seems to do about as much as the super-expensive applications that are tailored to OCR.
posted by nosila at 8:45 PM on November 28, 2011

« Older What Third Is This?   |   What to do about a friend living in an unhealthy... Newer »
This thread is closed to new comments.