OCRing a non-standard font
August 26, 2014 3:31 PM   Subscribe

I have a dead-tree book that I want to use OCR software on, but it has a strange font. My attempts so far have not been very successful.

I've used FreeOCR but the result is gibberish. In case it's relevant, I'm using a Win7 system and a Canon MP495 scanner. I can do the formatting side of things myself. I just need the raw text.

Is there some way to "train" the OCR software to recognise the font? Or maybe some better software? Or any other ideas?
posted by Solomon to Computers & Internet (4 answers total) 2 users marked this as a favorite
 
Given that it's a cursive font, maybe try handwriting OCR.
posted by ambrosen at 3:38 PM on August 26, 2014 [2 favorites]


I can't find any OCR-specific information, but that there is the old Selectric Script typeface. I have to think it's been dealt with before.
posted by rhizome at 3:49 PM on August 26, 2014 [3 favorites]


Best answer: I've got AABBY FineReader ($169), which is trainable to recognize an unknown font. I tested it out on your little sample, and by the end it had learned to recognize tricky repeated characters such as "l" and "t." If you want to shell out the price for AABBY and replicate my process, go to Tools>Options and set it to "read with training" (which is turned off by default). I also switched the fonts used for analysis to using only Lucinda Handwriting which is the closest script font on my computer. I don't know if that made a difference, but given that after encountering the same character a few times (and having you instruct it through the training interface where the boundaries of characters are) it seems to be able to pick things up. Readiris ($129) is a similar piece of software with similar capabilities (not quite as many bells and whistles as AABBY but those may not be ones you need). Unfortunately, my copy of Readiris is kind of old and crashed when I tried to test it on its sample (but seemed to be doing a pretty good job until it crashed). If you go this route and have basic questions about using the software feel free to drop me a line.
posted by drlith at 5:46 PM on August 26, 2014 [4 favorites]


It's intended for recognizing historical typeset text, but the Berkley NLP Group recently released Ocular.
Ocular can recognize collections of documents that use historical fonts. The system is unsupervised: you don't need document images that are labeled with human transcriptions in order to learn a particular historical font. Instead, Ocular learns the font directly, straight from the set of input document images you want transcribed.
posted by zamboni at 7:40 PM on August 27, 2014


« Older December-January in Japan   |   Can lactose intolerance cause leg muscle cramps? Newer »
This thread is closed to new comments.