Image processing of scanned text
April 9, 2009 2:26 PM   Subscribe

I have 18 copies of a 47-page document, scanned with handwriting on them. I want to extract the handwritten bits (i.e. compare, page-by-page, and eliminate the "constant" part), despite skewing, offset, and some noise in some copies. I want to use Perl or Python with e.g. ImageMagick or gd or something. Any pointers? I'm not talking about OCR -- just comparison, with one output being the graphical bits that don't match.

In case you're wondering, this is documentation for a clinical trial. The clinical history forms are printed, distributed to the physicians participating in the trial, filled out, stepped on, fed to the dog, and mailed back to the study center. Years later, lawyers give reams of this paper to translators for discovery during a lawsuit in a different country. Yes, you are paying for that when you fill a prescription, why do you ask? But I digress.
posted by Michael Roberts to Computers & Internet (12 answers total)
 
You'll need a reference image to subtract so you should scan some originals if you can. I don't know specifically how I would do this but I would probably ask the hugin mailing list as this is a similar problem. You want to auto-align & distort, then subtract images. Hugin auto-aligns and then blends images. Hugin is based on command line tools (all mysterious and opaque to me) and lends itself well to batch processing if you know what you're doing.

Hopefully someone smarter than me will answer your question.
posted by chairface at 3:21 PM on April 9, 2009


Getting good enough automated alignment to do the subtraction will be the problem. Any chance the handwriting is in a different color from the printed text? That wouldn't be so hard to extract.
posted by DarkForest at 4:51 PM on April 9, 2009


Response by poster: Nope, DarkForest, no such luck; these are regular black-and-white scans, probably Xeroxed at some point to stick into the files, then scanned as PDFs now.

But right, alignment is the key; if I had a clue on how best to approach that, the problem would be 70% solved.

chairface -- that's a good suggestion; I'll see what command-line tools hugin is using; I played around with hugin itself a couple of years ago to see what all that panorama stitching stuff was all about.
posted by Michael Roberts at 5:49 PM on April 9, 2009


Yeah, alignment is the hard part.

Here's a stupid brute force way that might work; the advantage is that you don't need any clever feature-finding AI...

Given: two scans (A and B) of the same page

Algorithm for finding alignment:
1. Reduce both scans to 1-bit color
2. Find the values of offset_x, offset_y, and rotation_angle that minimize this function:

Translate A by (offset_x, offset_y) and rotate by rotation_angle
Overlay the translated,rotated A on B (e.g., binary OR)
Return the number of black pixels in the resulting image

3. These values are a likely alignment

To do this, you'd have to explore a wide enough swath of offset_x, offset_y, rotation_angle space to account for differences in scans. It might take a long time, but processors are fast these days, and your data set isn't that big. To speed this up, you could work with only a subset of the images (i.e., upper left corner) and/or scale them down.

P.S. Don't do individual pixel operations in a scripting language, the slowness will kill you.
posted by qxntpqbbbqxl at 5:58 PM on April 9, 2009


Response by poster: Good God, qxntpqbbbqxl, I'm not going to do the graphics in the scripting language; we use libraries for that. I loves me my CPAN.

I think your basic approach is my best bet, yeah. Once I have a reasonable candidate, I suppose I could test another chunk of the page, rinse, and repeat. That should work.

What's 2009's best image manipulation library? I'm still just seeing GD and ImageMagick and, I guess, PIC in Python. This might be a PIC thing, actually. Although I'll probably use ImageMagick from the command line to pull the individual pages out first.
posted by Michael Roberts at 7:26 PM on April 9, 2009


There is commercial software that does template subtraction quite well, but it's not cheap (since it is typically bought by businesses who are processing many thousands of images). Depending on the complexity of the forms, it can also be a bit tricky to set up sometimes. Anyway, you could check out the Accusoft/Pegasus formfix product. It has a form drop out feature that might work, although you would likely find it too expensive for just a few hundred pages. It appears there is a trial SDK that is "free", although I'm not sure what the license terms/restrictions are for the trial version.

There are also data entry service bureaus that often own software that will do this. I have no idea what their price would be, but I mention this just in case you haven't yet considered outsourcing as an option. Try googling various combinations of the terms data entry/image processing/forms processing/service bureau/outsourcing, and you might find someone who could take care of this for you.
posted by blue mustard at 7:27 PM on April 9, 2009


If you want to work this hard, you could check out some open source OCR software and possibly modify it so that it blanked out the pixels of characters it recognized (the printed text) leaving just the handwriting. A long shot, but...
posted by DarkForest at 7:29 PM on April 9, 2009


I've used ITK (Insight Toolkit) to align images in the past. It includes a number of algorithms for this problem, which is officially called "registration."

ITK is free but devilishly tricky to use. You'll likely need a C++ programmer to configure it for your needs. A python binding for the library is also available.
posted by miyabo at 7:56 PM on April 9, 2009


Heh. The reason I cautioned against the pixels-in-scripting language thing is because I wound up doing that once because some feature I wanted wasn't in Python Imaging Library, and it totally (predictably) killed the performance. So whatever library you use, make sure you've got the right operations before you commit to it :P
posted by qxntpqbbbqxl at 8:31 PM on April 9, 2009


Response by poster: Oy, s/PIC/PIL/g in my last comment. That's what I get for not going to bed at a reasonable hour.

ITK looks really cool -- and I am a C++ programmer (or was, once, before discovering that I really prefer working in Python or Perl... of course, I had to wait until they were invented). That might be an option. Except for the "devilishly tricky" part. Its Python binding seems pretty well-developed. So I think I'll try that.

DarkForest, I don't think that would be optimum -- I would be ignoring the advantage of having 18 copies of each page. It would be something interesting to try if I were only interested in finding handwritten markup on a typeset document, though. Kind of a neat concept.
posted by Michael Roberts at 6:41 AM on April 10, 2009


Response by poster: blue mustard, the Accusoft Pegasus is exactly what I'm talking about! Except for the, you know, $1499 development license and $399 user license. I'm still kind of gasping for air about that.

I want an open-source way to do this (and not a service -- the idea is to bring more open source into the translation industry). Accusoft is at least giving me a feature set and terminology to Google on, though, so thanks!
posted by Michael Roberts at 6:47 AM on April 10, 2009


If you have the paper copies, you could try scanning them a scanner like a Fuji SnapScan, which claims to do the alignment and skew adjustment automatically, making subsequent processing far easier.
posted by James Scott-Brown at 10:28 AM on September 2, 2009


« Older I want my own FTP server (dammit).   |   I'm in your bluebook citing your articles Newer »
This thread is closed to new comments.