Good OCR options?
July 16, 2006 6:01 PM   Subscribe

Is there a good OCR program that can convert my old government science papers into nice electronic replicas?

When I did small projects like this 8 years ago I used an OCR program, clipped out the graphics one by one, and just kind of pasted it all back in PageMaker. Surely since those early years something has come out that has automated all this tedious work.

I'm hoping to thin out my paper library (and make things easily searchable) as well as share some of the stuff on my website with other people in the field. Ideally the end result will be PDFs.
posted by shannymara to Computers & Internet (4 answers total)
 
Acrobat Professional does a decent job at this. I have the best luck with making a "searchable" image, which just puts a scan of the document on top, but OCRs the text behind it, making it searchable (hence the name), and selectable. You can save the documents as text, or (with often very odd formating), Word or RTF or HTML. Making a 10-page PDF for a 10-page document is a breeze. PDF supports a good set of metadata, which I make extensive use of on my Mac system.

Setting the scanner up correctly is crucial, to avoid many errors. Black and white scan (not grayscale or color) if at all possible, tweak the threshold, 150-300 dpi (or more, if it's really small text), etc.

It's not cheap, though. It's something like $400 bucks without an upgrade.
posted by teece at 6:51 PM on July 16, 2006


First off, most OCR is still pretty bad. It's really hard to do segmentation well.

Currently, one of the industry leads is Abbyy. The app, FineReader, is fairly well put together and is fairly unsurprising in terms of UI. Expect to pay through the nose for PDF output. As far as I can tell, they license from Adobe for PDF generation.

ScanSoft is one of their leading competitors and their engine is OK, but they've clearly been asleep at the wheel in terms of features. They do have PDF in their feature set.

ExperVision does decent segmentation, but at the expense of some extra errors in the output. Their PDF output is an add-on and pricey. I'm also not impressed with it.

Its been a while since I looked at the Iris engine. It was supposedly pretty good, but I don't recall if it has PDF output.

Most of these packages also include RTF output at a minimum, sometimes full DOC output.

SimpleOcr is awful. Don't waste your time on it.

My experience comes from writing interfacing to run these engines in an engine-neutral way.
posted by plinth at 6:59 PM on July 16, 2006


Our company is getting ready to lauch OnBase, whose OCR component is called Verity. Anyone know anything about that?
posted by I_Love_Bananas at 3:22 AM on July 17, 2006


OnBase is ScanSoft.
posted by plinth at 3:44 AM on July 17, 2006


« Older Why do opposites attract?   |   Free or subsidized therapy in Toronto? Newer »
This thread is closed to new comments.