Fuzzy logic search to find international characters.
April 13, 2006 2:57 AM   Subscribe

I've got a database of names, many of which have international characters (e-acute, c-cedilla, o-umlaut, etc). I want the search routine to be clever enough that if I search for "Celik" it'll find c-cedilla-elik, even though "c" and "c-cedilla" are entirely different.

Does a look-up table exist that matches whole range of such non-English letters with their nearest-looking English equivalents? Or can anyone here help me construct one?

I'm thinking o and u umluat, c and s cedilla, o circumflex, Turkish g and undotted-i, Scandinavian o with a line through it, Spanish n, e with a grave and acute, accented a, the dipthongs.

Any more for any more?
posted by Pericles to Computers & Internet (16 answers total) 1 user marked this as a favorite
 
Perhaps someone will have a better or pre-made solution. But one thing you could do is use the Unicode Character Database, which will allow you to decompose, e.g. U+00C7 LATIN CAPITAL LETTER C WITH CEDILLA into U+0043 LATIN CAPITAL LETTER C, U+0327 COMBINING CEDILLA.
posted by grouse at 3:07 AM on April 13, 2006


Don't forget a-umluat.
posted by Goofyy at 4:11 AM on April 13, 2006


At least for MySQL, you can simply use the LIKE operator,
e.g. SELECT 'áçñ' LIKE 'acn' returns 1, but SELECT 'å' LIKE 'a' returns 0, so it doesn't work in all cases, but it might be good enough for you.
posted by Sharcho at 4:17 AM on April 13, 2006


Are you using MySQL?

Switch to a sensible DB, e.g. PostgreSQL.

or

Convert your (VAR)CHAR/TEXT columns to BLOB. I expect it won't try and be clever about things then.
posted by public at 4:39 AM on April 13, 2006


Convert your (VAR)CHAR/TEXT columns to BLOB. I expect it won't try and be clever about things then.

I don't understand this response. The OP wants more cleverness, not less.
posted by grouse at 5:01 AM on April 13, 2006


It's an Oracle database.
posted by Pericles at 5:07 AM on April 13, 2006


Try DevonThink Pro. I don’t know how extensive the “character equivocation” (my term) can get, but it has a stunningly-intelligent fuzzy logic search and can, from my experience, ignore/‘equivocate’ umlauts.
posted by Yeomans at 5:19 AM on April 13, 2006


You might want to check out the levenshtein distance algorithm. It calculates the "distance" between two strings (the number of edits it would take to transform 1 string into another). In your case, although "c" and "c-cedilla" are different characters, the levenshtein distance between them would be 1 (so you would find it if you searched for all matches within a certain range). Here is an implementation in Oracle's PL/SQL.
posted by tperrigo at 6:10 AM on April 13, 2006


('Course, the levenshtein distance between "C" and "Q" or "j" or "!" is also 1. Pericles wants "Celik" to match "Çelik" but not "Qelik" or "jelik" or "!elik".)
posted by nebulawindphone at 6:15 AM on April 13, 2006


I'm pretty sure a dumb look-up table would do it. User types in Celik, search routine goes to table and sees ç next to "c" and so also searches for that.
posted by Pericles at 6:34 AM on April 13, 2006


To tag along, I have this problem in my iTunes library. Any ideas for solving that problem (other than chaning the names of tons of bands and songs, of course)?
posted by zpousman at 6:42 AM on April 13, 2006


Depending on how you're setting things up, one way to do this is to store a hidden field that has the ASCII-fied version of your accented strings, and whenever someone does a search, run their query string through the same ASCII-fier before doing the search.

Obviously this would get clunky if you're dealing with long blocks of text, but I did use something like this technique when there were only a few searchable fields. The overhead was low and it worked well.

I did a little googling and ran across the following links that may be of use to you:
1. Linguistic sorting & string searching. This is specifically about Oracle.
2. Multilingual databasing techniques. This discusses my ASCII-fied technique, along with a clever variation on it.
posted by adamrice at 7:15 AM on April 13, 2006


Thanks Adam Rice. I did do a little googling (honest!) but wasn't clever enough on my search terms, evidently ...
posted by Pericles at 8:12 AM on April 13, 2006


This may be overkill for what you're looking for, but IBM's ICU library (in Java and C++), has transliteration abilities of the type you're looking for. I haven't used it myself (yet anyway), but from the examples they have online for my own target language (Greek), the features are pretty impressive.
posted by costas at 8:52 AM on April 13, 2006


If you do run a search against an ASCII-only version, be sure you aren't making it more difficult for users who already know how to type international characters to find exactly what they're looking for.

That is, if you search for Jacob, and find Jacöb, great, but if you search for Jacöb and get no results, that's bad.
posted by odinsdream at 10:15 AM on April 13, 2006


have you thought about using soundex or metaphone algorithms for this? they generate a key for the string based on how the string sounds. (both those algorithms were originally intended for use with English but I imagine you could or someone has adapted them for other languages and such.)
posted by mrg at 11:15 AM on April 13, 2006


« Older How to distribute access to a shared television?   |   How to make it mobile? Newer »
This thread is closed to new comments.