Fuzzy text completion algorithm?
May 10, 2008 9:44 AM
I'm looking for a fuzzy text-matching algorithm for an autocomplete widget.
I have a data set of ~4000 short strings (< 60 chars) that needs to be attached to an AJAXy autocomplete widget. I'm looking for a text completion algorithm that supports some amount of slop, to catch basic typos.
The environment is Prototype/LowPro on top of Ruby, backed by a MySQL database. I can pre-compute and store metaphones/soundex/&c. either in the database or in memory, as this list will remain mostly static throughout application lifetime.
The bad news is this feature is not mission-critical, so anything like compiling a user-defined SQL function (e.g. Levenshtein) is out of scope. The good news is this feature isn't mission-critical, so I am only solving for the quickest route to a reasonable set of suggestions for a given input string -- this doesn't need to be bulletproof.
What would you do, given these constraints?
I have a data set of ~4000 short strings (< 60 chars) that needs to be attached to an AJAXy autocomplete widget. I'm looking for a text completion algorithm that supports some amount of slop, to catch basic typos.
The environment is Prototype/LowPro on top of Ruby, backed by a MySQL database. I can pre-compute and store metaphones/soundex/&c. either in the database or in memory, as this list will remain mostly static throughout application lifetime.
The bad news is this feature is not mission-critical, so anything like compiling a user-defined SQL function (e.g. Levenshtein) is out of scope. The good news is this feature isn't mission-critical, so I am only solving for the quickest route to a reasonable set of suggestions for a given input string -- this doesn't need to be bulletproof.
What would you do, given these constraints?
Depends on what you're looking for. If it's names, it's a pretty well solved problem, with several brands of algorithms. Metaphone, Soundex, and further variations on the same idea.
I haven't seen algorithms for generic words, but a similar idea would probably work. I'd try soundex per word (ie. sentences of soundex codes of the words) and then just use the SQL LIKE command to search em out after encoding the user's input.
I know there is a ruby gem that has the soundex, metaphone, and variations algorithms in it, give it a try.
posted by cschneid at 11:34 AM on May 10, 2008
I haven't seen algorithms for generic words, but a similar idea would probably work. I'd try soundex per word (ie. sentences of soundex codes of the words) and then just use the SQL LIKE command to search em out after encoding the user's input.
I know there is a ruby gem that has the soundex, metaphone, and variations algorithms in it, give it a try.
posted by cschneid at 11:34 AM on May 10, 2008
Hmm, I was lazy and barely read your question, what's wrong with just using the approach you have already come up with? Can you add a column to your table and just store it? If not, then I think you just do it all in ram, it's not that big.
posted by cschneid at 11:36 AM on May 10, 2008
posted by cschneid at 11:36 AM on May 10, 2008
Heh... no snark taken, adipocere. This was an actual business scenario, and as such I couldn't devote an entire workday to pondering it -- so this is me revisiting a problem I found really interesting. Given the same parameters but unlimited think-time, what's the better/faster/smarter solution I had to forgo?
Let's say response time is at least as important as accuracy, if it helps define the problem -- Ruby is slow, MySQL has few built-in text processing functions. Basically I'm just curious to see how other people have approached this.
posted by sonofslim at 12:02 PM on May 10, 2008
Let's say response time is at least as important as accuracy, if it helps define the problem -- Ruby is slow, MySQL has few built-in text processing functions. Basically I'm just curious to see how other people have approached this.
posted by sonofslim at 12:02 PM on May 10, 2008
I'd try to use the programatic mode of ispell/aspell and a custom dictionary and see if that is good enough. Baring that I'm sure 100 people have written fuzzy search libraries for ruby, the better ones probably even do the heavy lifting in C/some other fast language. No need to reinvent the wheel.
posted by aspo at 4:43 PM on May 10, 2008
posted by aspo at 4:43 PM on May 10, 2008
I'm sure 100 people have written fuzzy search libraries for ruby
You'd think so, but if they're out there I haven't found them.
At this point I suspect a smartly-indexed table is the quickest path to relevant results -- something involving an n-gram solution tailored to the live data. It's a list of schools in the U.S., so a full 75% of these strings include 'College' or 'University' and there must be a way to account for that.
posted by sonofslim at 8:44 PM on May 10, 2008
You'd think so, but if they're out there I haven't found them.
At this point I suspect a smartly-indexed table is the quickest path to relevant results -- something involving an n-gram solution tailored to the live data. It's a list of schools in the U.S., so a full 75% of these strings include 'College' or 'University' and there must be a way to account for that.
posted by sonofslim at 8:44 PM on May 10, 2008
I have implemented this before using pre-computed Double Metaphone (not Metaphone) strings stored in a database and indexed. It worked very well. With Double Metaphone, you can determine whether words are a strong, medium, or weak match, which is useful for ranking the results (e.g., you want to display something to the user even when there is only a weak match.
I did it using a user-defined function in SQL Server, but you could do this in code just as easily.
There are some things you can do to make the results better - return the top N matches that are the exact same text first (case insensitive), followed by the double metaphone matches ranked by strong, medium, then weak, taking care to make sure you don't give any duplicate results.
posted by SNACKeR at 5:08 AM on May 11, 2008
I did it using a user-defined function in SQL Server, but you could do this in code just as easily.
There are some things you can do to make the results better - return the top N matches that are the exact same text first (case insensitive), followed by the double metaphone matches ranked by strong, medium, then weak, taking care to make sure you don't give any duplicate results.
posted by SNACKeR at 5:08 AM on May 11, 2008
After playing around, I've come up with 2 approaches. The first is a simple metaphone match, ordered by results that contain the actual search term (weighted by proximity to the start of the string) followed by near matches ordered by Levenshtein distance.
The second is a table containing unique 4-character n-grams and a join table associating those with individual records. I think 3 characters would give more accurate results, but this was just an experiment -- plus it took 3 hours to generate this table and it has over 99,000 rows as is.
Still, with proper indexing both solutions are plenty fast. As I expected, the first solution is better at solving for poor spelling, whereas the second solution is better for poor typing -- transpositions, omissions, and the like. I'll have to keep an eye on the actual usage and see which is more useful from an end-user perspective.
Thanks for everyone's help!
posted by sonofslim at 8:05 PM on May 11, 2008
The second is a table containing unique 4-character n-grams and a join table associating those with individual records. I think 3 characters would give more accurate results, but this was just an experiment -- plus it took 3 hours to generate this table and it has over 99,000 rows as is.
Still, with proper indexing both solutions are plenty fast. As I expected, the first solution is better at solving for poor spelling, whereas the second solution is better for poor typing -- transpositions, omissions, and the like. I'll have to keep an eye on the actual usage and see which is more useful from an end-user perspective.
Thanks for everyone's help!
posted by sonofslim at 8:05 PM on May 11, 2008
« Older Help me become smarter and have a more dynamic... | How do I buy the right laptop battery when all the... Newer »
This thread is closed to new comments.
I'd do to the user input the same thing you do to your dataset - lowercase both, take out spaces and punctuation ... Soundex on both ends, etc.
posted by adipocere at 11:26 AM on May 10, 2008