What is the best Optical Character Recognition software?
September 14, 2008 12:19 AM   Subscribe

I am looking for the best of the best OCR (Optical Character Recognition) software out in the market that can translate images ( through scanner) of handwritten, typewritten or printed text into editable text.
posted by omaralarifi to Computers & Internet (9 answers total) 5 users marked this as a favorite
 
I hate to say this: my experience is that there aren't many.

I imagine you've already seen this chart on the OCR Wikipedia page. Yes, the chart is pretty spare. That's because the market is pretty spare. Nobody's developed this as far as it can go yet, unfortunately. I think it's a great field for further work, which is why it's gotten some of my interest as a maybe-I'll-try-that project. So far as I can tell (I'm no expert, so maybe somebody can correct me on this) the companies that are using it extensively now aren't using any commercially-available package but rather something they've brewed themselves.

The most direct answer to your question is this, I think:

The best combination that I've seen of user-friendliness and accuracy is the Microsoft Office Document Imaging module that's available in Microsoft Office. That is, it's a GUI, you can click a few things and OCR a document and then copy it and paste it into another document. If that's something that you really want. Some people have even interfaced through COM and done some C# coding through this, so that, in plain English, if you have a good programmer, you might be able to cook up your own homebrew solution there.

I'm fairly certain that Microsoft acquired the software it uses for this from OmniPage. So far as I can see, it's intended solely as a feature, and I don't think Microsoft is putting any development at all into it. It is odd and vexing that no one appears willing to bring better accessible OCR to market, although maybe someone has that I'm not aware about.

A more accurate, but less immediately user-friendly, solution is the Tesseract engine. I believe that Tesseract is probably the best hope for OCR at the moment; at the least, it represents the most continuous research into the problems involved and the best presentation of that research. It was developed at HP labs (here in Colorado!) between 1985 and 1995 and got a lot of prolonged effort from some very good minds. Then, in 2005, HP was good enough to open up the source, and Google took up development under the Apache license.

The downside is that Tesseract is command-line only at the moment. That means opening up "run" and typing in a command whenever you want something OCR'd. It also doesn't do layout analysis - that is, it doesn't separate the areas of the page up although hardly any of them do, and I can attest to the fact that, though MODI seems to try, it usually fails.

The upside is that it's more accurate than MODI, sometimes by a lot. (I think the tests showed a percentage point difference, which is a heck of a lot when it comes to OCR.) Also, it's much more open to development, and, since it's on an Apache license, if you can find a good programmer, you can set something up for a company that needs it. Also, there's good ongoing research into it; there were a few guys teaching it to do Russian last I checked, which I think is pretty cool.

I have heard that OmniPage is good. I'd thought they had some kind of deal with Microsoft, like I said above, on their software. I don't quite know, but I do know I encountered their name a bit whilst digging through the Microsoft OCR mechanism.

Sorry I can't offer you a prepackaged solution, but I don't know of any. If anyone does, they're welcome to speak up: who's heard of good OCR programs? I'd like to know myself.
posted by koeselitz at 2:00 AM on September 14, 2008 [1 favorite]


Sorry; missed a link. I referred to this chart on Wikipedia.
posted by koeselitz at 2:01 AM on September 14, 2008


Acrobat Professional - Mac & Windows
DevonThink Pro - Mac

Acrobat is a bit screwy for me on my Mac, but it works, kinda. I have both version 7 and 8 professional, and while 7 is slower, it seems more reliable. I haven't tried DevonThink Pro, but AFAIK it does OCR well enough to be usable.
http://www.devon-technologies.com/products/devonthink/index.html

FYI I received the relevant copies of Acrobat when purchasing a Fujitsu Scansnap for my Mac. Now this is an awesome document scanner - insert documents, press the button, and enjoy double sided OCR scans in, well, a snap.
posted by newformula at 2:34 AM on September 14, 2008


A while back in my search for the best OCR software, all signs seemed to point towards ABBY Finereader (windows). It seemed a polished product to me, and gave good results.
posted by simplesharps at 3:38 AM on September 14, 2008


Another vote for FineReader.

Acrobat's built-in OCR is now far better than it was circa versions 5 and 6.
posted by yclipse at 4:59 AM on September 14, 2008


We use ABBYY Fine Reader at work (library digitization program), and it's supposed to be the best.
posted by MsMolly at 12:45 PM on September 14, 2008


If you just need a PDF document where you can highlight and copy text (in my case, academic articles that had been scanned), Acrobat works well most of the time (although it occasionally gets the page structure totally, totally wrong). On the other hand, if you have something in mind other than just document markup and copying text to the clipboard, You probably want to look elsewhere.
posted by LMGM at 5:13 PM on September 14, 2008


I think Tesseract works very well, even though it's been purchased by Google.
posted by mannyosu at 8:26 AM on September 16, 2008


Minor quibble: Tesseract is open-source, and therefore wasn't "purchased" by Google - it's under the Apache license. It's free for any company to use it if they like, and it doesn't appear to me that Google is putting much development manpower on the project, if they're putting any at all. (Apologies if I'm wrong about that.) Rather, they've been kind enough to offer hosting and to encourage others to work on the project rather than allow it to languish. That's what works best for open-source software - anybody who wants to can help on it.
posted by koeselitz at 1:34 PM on September 17, 2008


« Older After digging through the prev...   |  I recently heard a rumor that ... Newer »
This thread is closed to new comments.