OCR, emphasizing the R
May 6, 2014 2:02 PM Subscribe
Does anyone have any recommendations for OCR software that focuses on the "recognition" part?
Basically I want something that can tell me "there is probably text in this image", which would allow me to then transcribe it by hand.
Basically I want something that can tell me "there is probably text in this image", which would allow me to then transcribe it by hand.
Response by poster: I should mention that the source images are very noisy, with lots of (non textual) background stuff, and potentially low resolution.
posted by dilaudid at 2:33 PM on May 6, 2014
posted by dilaudid at 2:33 PM on May 6, 2014
OmniPage (I think they're up to version 18) does a pretty good job of dividing a page into blocks of images and text.
posted by Melismata at 2:44 PM on May 6, 2014
posted by Melismata at 2:44 PM on May 6, 2014
I haven't used it, but the Tesseract OCR engine supposedly returns confidence levels on recognized characters, so you could look for scores above a certain threshold.
Alternately, you could train OpenCV to recognize the presence of characters, perhaps.
posted by zippy at 2:49 PM on May 6, 2014
Alternately, you could train OpenCV to recognize the presence of characters, perhaps.
posted by zippy at 2:49 PM on May 6, 2014
You might use Tesseract to try to pull text out of your input image. Then run the OCRed text against a dictionary. If you get above a certain fraction of words that positively match dictionary entries, then you can use that to train a binary classifier, given a set of known positive and known negative inputs. Once trained, you could run it against arbitrary inputs, to filter into a set of inputs that you want to look at more closely (i.e., transcribe by hand).
posted by Blazecock Pileon at 3:04 PM on May 6, 2014
posted by Blazecock Pileon at 3:04 PM on May 6, 2014
I use Abbyy Fine Reader all the time.
However, given that you are just looking for the likelihood of text within an image, you have a variety of options. Here is a small list of options and an online option.
One trick that I have found to be useful when running OCR on "noisy" images is to convert that image to black/white or greyscale. It gets rid of a lot of speckling and the OCR engine does not have to evaluate differences in color.
posted by lampshade at 2:09 AM on May 7, 2014
However, given that you are just looking for the likelihood of text within an image, you have a variety of options. Here is a small list of options and an online option.
One trick that I have found to be useful when running OCR on "noisy" images is to convert that image to black/white or greyscale. It gets rid of a lot of speckling and the OCR engine does not have to evaluate differences in color.
posted by lampshade at 2:09 AM on May 7, 2014
« Older Help my roommate and I find a fair and reasonable... | Probably the answer is to not text so damn much? Newer »
This thread is closed to new comments.
posted by rhizome at 2:20 PM on May 6, 2014