Strategy to proofread document from OCR
August 12, 2014 12:33 PM   Subscribe

I have a document in English of about 1500 pages that was originally derived from OCR scans of varying quality. I've proofread manually and by spell-check. What is the best strategy to eliminate the remaining errors?

Originally I found sometimes 2 errors per 50 pages, sometimes 100. Since I have already proofread the document both by careful spell-checking (quite a few arcane words made this a slow process) and by reading it page-by-page and weeding out errors, I'm guessing only a few dozen remain. My goal is to greatly minimize the instance of errors, hopefully down to the single digit. Another read-through would eliminate most of those but is a daunting task and frankly, I would have to fight eye-glaze and might gloss over the remaining errors.

Is there a resource that documents common errors in OCR that are not picked up by spell-check? (Letter recognition that converts one word to a different, incorrect word. A classic example would be modern to modem.) The following link is somewhat helpful but could use more word examples rather than letter combinations.

Is there another strategy to optimize reviewing this document?
posted by dances_with_sneetches to Computers & Internet (11 answers total) 9 users marked this as a favorite
Setting it in a font adapted for proofreading OCRed text wouldn't eliminate the need for another round of proofreading but might help prevent any of the remaining errors from slipping through.

Another strategy might include searching for, or highlighting, letter pairs or digits that are commonly mis-rendered (e.g. the "Very Unsafe List" in your link). You are probably aware of some of these already, and looking at the design of the font might suggest more.

Other than that, can you break it up into manageable portions and recruit others to help you out?
posted by pullayup at 1:08 PM on August 12, 2014

You could try reading it backwards. This is a trick used by people proof-reading their own writing. If I write a paragraph and make an error, when I proof-read it from start to finish, I will often miss the error because I know what I intended to say and my mind will fill in or gloss over what was actually written. Reading it in reverse can make an error jump out at you and not get glossed over.
posted by Michele in California at 1:53 PM on August 12, 2014

Some of the ways that we used to proof-read dictionaries: spell by word frequency against a corpus (tricky, unless you have corpora lying around); simple word frequency lists - the least-frequent 10-15 words in your document are likely typos; various regex checks (like numbers in the middle of words, or repeated words, or unexpected capitalization, or ...). You should be able to convert the file to text, then develop your tests in awk, or similar.

If you're reading printed prose, read through a 1-3 line cardboard aperture to isolate the words from context. Reading the thing upside down also helps to break up context (and mad reading upside down skills are essential in any business context these days). I would find that proof-reading font very slow, as fixed-width characters are slow to read, and the general fugliness would make me seethe ...
posted by scruss at 2:37 PM on August 12, 2014 [2 favorites]

If at all possible, find another awesome proofreader to do another pass. There's no substitute for another set of eyes, no matter how good you are.
posted by fiercecupcake at 2:51 PM on August 12, 2014 [1 favorite]

The tool that the Distributed Proofreaders community uses for this exact task is called gutcheck. This tool is also compiled into an even-more-featured tool called guiguts, but that one may be too much for your needs as it includes a lot of formatting tools, but it also includes features like easy word-frequency-list generation and the like.

The term for a "typo" at the word-level caused by OCR error is "Scanno" - that may help you in looking for a common scanno list.
posted by muddgirl at 3:32 PM on August 12, 2014 [6 favorites]

I forgot to mention that guiguts also has a bunch of the regex searches that scruss mentions built-in.
posted by muddgirl at 3:35 PM on August 12, 2014

PPQT is a program built for Distributed Proofreaders users who prepare books. It will highlight a list of "scannos" (legitimate but possibly wrong words read by OCR) throughout the document, and allow the list to be modified. (Or use your own list, one word per line, case sensitive. Load via "File/Scannos". "View/Scannos" turns the highlighting on and off.)

This feature, as you mention, is one of the best remaining heuristic methods for finding problematic words.

It's very "locally used" software so you'll have to trust me that the builds for different operating systems here:

are legitimate software (here is the github master). This Youtube introduction may or may not assist your use case much.

Another useful feature are the word and character counts. There are tabs for each within the right panel.

Another way to clean up an OCR document is with regular expressions, especially for punctuation. You can "Find" these expressions using the Find tab. The regular expressions aren't in the software but I can provide a few -- e.g. should a period really be before that lowercase letter, or is the period OCR garbage? -- that you can use with the software's Find function.

I won't go into more detail for now unless you pursue this method. Either of these two approaches (scannos & regexes for punctuation checking) alone will probably improve the document if you're willing to eyeball each page again; probably both. It depends on how funky the OCR was in the first place.
posted by sylvanshine at 3:42 PM on August 12, 2014

[muddgirl mentions another toolset used at DP. :-) The problem with Gutcheck is that its output is 99.8% false-positive junk. In my opinion! You'll also be staring at a wall of command-line output, tracking down line/column numbers; unless you use it inside GuiGuts, which can be hard to install. It's built in Perl and is no longer maintained.]
posted by sylvanshine at 3:53 PM on August 12, 2014

Those DP tools look lovely, but seem to be hard-coded to the requirements of Project Gutenberg: assuming certain file formats, and that single page scans are available in a particular folder. It doesn't sound like the OP has this.
posted by scruss at 4:43 PM on August 12, 2014

scruss, I'd say that the DP tools have a superset of the features that dances_with_sneetches actually needs, but they definitely have features that are relevant to dws's problem. Sure, the tools can display the original page image, or muck around with text wrapping and HTML, but you don't need to have or do those things with these tools to sort of "heuristically proofread" a document with them; especially the PPQT one, IMO. The OP's question is one part of the DP workflow, but an important one, and the tools deal with it. I've used that subset of PPQT myself on OCR projects.

In short, if the proofread text still has "He said be bad a lovely tine arid would visit again", PPQT is going to highlight four words for review in that sentence.
posted by sylvanshine at 7:34 PM on August 12, 2014 [1 favorite]

What I did end up doing is using the term "scannos" as suggested in the answer above and found more discussions of the problem. Then I put together my own strategy. Since I've already eliminated 99% of the scannos, I compared my original file with my recently edited file. That way I found the most common scannos and checked again for those. I'm still doing that, but I've caught about 20 overlooked scannos in 400 pages. Beyond that I am reviewing by hand carefully 100 pages. This allows me to gauge about how many errors remain. If it is still a lot (and I don't think it will be) I will need to add another layer of strategy, possibly reviewing each chapter by hand. If I am at 2 or fewer scannos per 100 pages, I'll be satisfied.
posted by dances_with_sneetches at 6:55 AM on August 14, 2014 [1 favorite]

« Older How to talk to adult male about uncircumcised...   |   Can I ask for another nurse? Newer »
This thread is closed to new comments.