OCR, emphasizing the R
May 6, 2014 2:02 PM   Subscribe

Does anyone have any recommendations for OCR software that focuses on the "recognition" part?

Basically I want something that can tell me "there is probably text in this image", which would allow me to then transcribe it by hand.
posted by dilaudid to Technology (6 answers total)
 
I'd start by running OCR over the image, paying attention just to a "true/false" result, basically normal OCR where you throw away what the software thinks is there.
posted by rhizome at 2:20 PM on May 6, 2014


Response by poster: I should mention that the source images are very noisy, with lots of (non textual) background stuff, and potentially low resolution.
posted by dilaudid at 2:33 PM on May 6, 2014


OmniPage (I think they're up to version 18) does a pretty good job of dividing a page into blocks of images and text.
posted by Melismata at 2:44 PM on May 6, 2014


I haven't used it, but the Tesseract OCR engine supposedly returns confidence levels on recognized characters, so you could look for scores above a certain threshold.

Alternately, you could train OpenCV to recognize the presence of characters, perhaps.
posted by zippy at 2:49 PM on May 6, 2014


You might use Tesseract to try to pull text out of your input image. Then run the OCRed text against a dictionary. If you get above a certain fraction of words that positively match dictionary entries, then you can use that to train a binary classifier, given a set of known positive and known negative inputs. Once trained, you could run it against arbitrary inputs, to filter into a set of inputs that you want to look at more closely (i.e., transcribe by hand).
posted by Blazecock Pileon at 3:04 PM on May 6, 2014


I use Abbyy Fine Reader all the time.

However, given that you are just looking for the likelihood of text within an image, you have a variety of options. Here is a small list of options and an online option.

One trick that I have found to be useful when running OCR on "noisy" images is to convert that image to black/white or greyscale. It gets rid of a lot of speckling and the OCR engine does not have to evaluate differences in color.
posted by lampshade at 2:09 AM on May 7, 2014


« Older Help my roommate and I find a fair and reasonable...   |   Probably the answer is to not text so damn much? Newer »
This thread is closed to new comments.