Strategy to proofread document from OCR
August 12, 2014 12:33 PM Subscribe
I have a document in English of about 1500 pages that was originally derived from OCR scans of varying quality. I've proofread manually and by spell-check. What is the best strategy to eliminate the remaining errors?
Originally I found sometimes 2 errors per 50 pages, sometimes 100. Since I have already proofread the document both by careful spell-checking (quite a few arcane words made this a slow process) and by reading it page-by-page and weeding out errors, I'm guessing only a few dozen remain. My goal is to greatly minimize the instance of errors, hopefully down to the single digit. Another read-through would eliminate most of those but is a daunting task and frankly, I would have to fight eye-glaze and might gloss over the remaining errors.
Is there a resource that documents common errors in OCR that are not picked up by spell-check? (Letter recognition that converts one word to a different, incorrect word. A classic example would be modern to modem.) The following link is somewhat helpful but could use more word examples rather than letter combinations.
Is there another strategy to optimize reviewing this document?