Soundex for images of text
July 28, 2005 2:52 PM   Subscribe

Does anyone know of a fairly fast algorithm for generating a "hash" of images which behaves like soundex, in that similar images have similar values? I know that's not possible in general, but I'm particularly interested in text - so what I really want is something that says "boo" is more like "bon" than "box" (in the character set I am using to type this, at least). You can assume no background noise, good orientation etc.

The emphasis on images is important because the character set could be anything in unicode. In other words, I want something that says whether "mathowie" is "mathowie", no matter how much embedded unicode (including non-latin characters that look similar to latin ones).

(That's what made me think of asking this - but I'm asking out of curiousity, not because I want to push Matt towards implementing anything. I suspect it's a hard problem, but it also seems like something that people might have attempted before - signature recognition, for example)
posted by andrew cooke to Computers & Internet (17 answers total)
 
You'd have to isolate the letters, which wouldn't be too hard. Then I'd probably convert the letters into paths, sample the paths at intervals, overlay them, and see how closely they match up. If you do more than 4 samples, n should be closer to o than x. Now, if you take different fonts/etc into consideration, you're all but fucked, and your best bet is to try to use text recognition in some unicode way and then just use soundex on it.
posted by devilsbrigade at 3:07 PM on July 28, 2005


Response by poster: Different fonts don't matter, but for efficiency you'd need a single value that represents a particular word (user), since each new user you'd need something efficient to compare against.

Put another way - what you describe is good at checking one signature against a default signature on file, but not so good at characterising a signature so that it can be recognised from a large set (or at least the number of candidates from the set reduced).

See what I mean? It's almost there, but it's not a complete solution. Cheers.
posted by andrew cooke at 3:25 PM on July 28, 2005


Given good orientation, you might get decent results by looking at a 2D wavelet transform. The hash would be a quantized version of some subset of the wavelet coefficients, with most attention paid to the lower frequencies. A Haar wavelet basis would probably be a good choice for text.

You'd have to have pretty ideal data sets to pull it off though. In general, it seems like an OCR engine coupled with something like soundex would be more robust...
posted by Galvatron at 3:26 PM on July 28, 2005


I agree with devilsbrigade - break the problem down into OCR and soundex.

I've used a simple root-mean-square to gauge the distance between images before, but I wouldn't trust it on anything more than a toy problem.

Searching google scholar for "automatic signature verification" throws up thousands of results, many of which rely on "watching" the signature being written, which suggests that simply comparing images doesn't work. It looks like this is an area of active research, rather than one where you can just pull a good solution off the shelf.
posted by Leon at 3:30 PM on July 28, 2005


Just to second the above, yes, a wavelet transform will get you a lot of the way there. It's not easy math though. If Galvatron's explanation doesn't make much sense, good luck! :)
posted by wackybrit at 3:31 PM on July 28, 2005


Response by poster: ocr and soundex wouldn't work - the idea is to allow a variety of languages to co-exist without having similar *looking* words. so you'd need one heck of an ocr to recognise every "different looking" unicode character and then be left with something that often had no suitable english sound (if mathowie is greek it should still work).
posted by andrew cooke at 3:33 PM on July 28, 2005


Response by poster: yeah, i think wavelets might be good.
hmmm. i've been meaning to learn more about them... (even have a half read book on my shelf here).

anyway, thanks for the replies. will read more tomorrow morning, but now off to catch the overnight bus.
posted by andrew cooke at 3:35 PM on July 28, 2005


Response by poster: wavelets + pca?
posted by andrew cooke at 3:37 PM on July 28, 2005


If you actually get to the point of writing some code, QccPack has a bunch of research-oriented wavelet and quantization algorithms that you might find useful.

What's pca?
posted by Galvatron at 3:44 PM on July 28, 2005


Oh, right, principal component analysis. That could be effective, but you might also find that you can come up with relatively simple heuristics that perform well.
posted by Galvatron at 3:49 PM on July 28, 2005


I found this paper on using wavelet transforms to create an image search engine to be of significant use when I was learning about this stuff. It's reasonably usable even with no knowledge on the subject.
posted by wackybrit at 4:04 PM on July 28, 2005


The "shape context" approach used by Mori and Malik in Breaking a Visual CAPTCHA would appear to hold promise, particularly since the images you have in mind aren't specifically designed to be adversarial.
posted by RichardP at 4:10 PM on July 28, 2005


I'm not sure if I understand the question...

It sounds like you are looking for something analogous to a 1337 speak decoder, except using all of unicode.

So you are looking to set up a metric for the apparent similarity fo different codes. So 'e' has a similarity to 'e' of 1.0, but '1' has a similarity to 'l' of 0.9, and so on...

I guess you could use some transform method to form the metric. You could just take the RMS over the 2D correlation of each symbol with each other symbol and normalize. It is worth a try, but humans probably see similarity in different ways from the mathematical calculation.

One obvious problem is the one soundex appears to solve, similar sounding characters can look very different, but humans might 'see' them as similar anyway. Like greek-pi and latin-p for example, and the c-k or s-c problems, etc...

Of course you only ever have to do the correlations once, then you could probably sell the table for a lot of money, and spend the next ten years tweaking it based on subjective human testing.

I dunno, maybe I am missing something...
posted by Chuckles at 6:29 PM on July 28, 2005


If you know python (and want to look at the wavelet idea more) take a look at this.
posted by AmaAyeRrsOonN at 8:00 PM on July 28, 2005


Response by poster: thanks everyone. i will follow up some of these in my copious spare time :o)

chuckles - you've got the problem right, but it's a bit harder than you think because i want a single number for the whole word. what you describe would mean checking letter by letter against each existing user (in the example of finding out whether someone is registering a name that looks like someone already registered), and would be thrown by a narrow, insignificant letter used as "padding" (a leading space, say).
posted by andrew cooke at 3:55 AM on July 29, 2005


Response by poster: ooo imgseek is cool all by itself, relevant or not. thanks.
posted by andrew cooke at 3:56 AM on July 29, 2005


So I found a site describing the soundex algorithm:
  1. Capitalize all letters in the word and drop all punctuation marks. Pad the word with rightmost blanks as needed during each procedure step.
  2. Retain the first letter of the word.
  3. Change all occurrence of the following letters to '0' (zero): 'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.
  4. Change letters from the following sets into the digit given:
    1 = 'B', 'F', 'P', 'V' 2 = 'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z' 3 = 'D','T' 4 = 'L' 5 = 'M','N' 6 = 'R'
  5. Remove all pairs of digits which occur beside each other from the string that resulted after step (4).
  6. Remove all zeros from the string that results from step 5.0 (placed there in step 3)
  7. Pad the string that resulted from step (6) with trailing zeros and return only the first four positions, which will be of the form .
What they have done is define a new alphabet that is minimal in some sense, you could say they have defined the basis set. Then they express every new word in terms of the new minimal alphabet (well, my analogy isn't perfect...). You should be able to do something similar, but appropriate for visual representations too...

You have to define your base set of characters, the ones you consider visually unique enough to be part of your minimal set. A certain subset of characters will project onto zero, in soundex they do that with all spaces and punctuation. Visually you might only do that with white space, or you might do it with '_' too... Then you have to express incoming letters in terms of your new basis set, this is where the correlation table I described above comes in.
posted by Chuckles at 10:27 AM on July 29, 2005


« Older Should I keep a box spring mattress?   |   Affordable and nice hotel in Manhattan Newer »
This thread is closed to new comments.