Building a search engine
July 26, 2004 4:58 AM   Subscribe

Building a search engine - how to cope with mis-spellings?

I have quite a few full-text index searches that I maintain, and I'd love to make them cope with mis-spellings better. For example, if someone is looking for say 'refrigerator', then even if they try searching for 'fridge' or 'refridgerator' then they'll still get results. I'd also like to get this to work with place-names, so someone searching for 'Aberystwyth' will still get results if they spell it incorrectly. What's the best way to go about this? I've thought about using some sort of phonetic approach but this seems to be overkill for my needs. Any lists of common mis-spellings that google has found for me seem to be a bit inadequate for what I want too. Any suggestions?
posted by BigCalm to Computers & Internet (9 answers total)
 
Take advantage of open source and use Aspell. While it won't fix your refrigerator/fridge problem, it'll do well for honest misspellings. If you're using PHP, there's a good built-in Aspell API called Pspell.

I also recommend doing what Google does and doing Did you mean refrigerator? when someone searches for "refridgerator" instead of silently correcting it.
posted by zsazsa at 5:46 AM on July 26, 2004


i could have sworn i read a paper not that long ago that had a detailed explanation of efficient searching for mis-spelt words, but i can't find the reference on my mailing list. however, looking back through the archives i did find tim bray's notes, which might be useful.
posted by andrew cooke at 5:52 AM on July 26, 2004


ah, found the paper - nrgrep. however, it's more for searching that working with pre-built indices.
posted by andrew cooke at 5:57 AM on July 26, 2004


Look into phonetic algorithms like Soundex or Metaphone. They compute a hash for how a word "sounds" so that you can search for other words that have the same hash.

For example, you might have an SQL query "SELECT * FROM table WHERE SOUNDEX(title) = SOUNDEX($search_term)" You may want to precompute the soundex value for some stuff, I imagine a full text search with soundex would be slow on a larger site.

MySQL and PHP support soundex and if you're using Perl you can grab Text::Soundex from CPAN.
posted by revgeorge at 6:55 AM on July 26, 2004


Response by poster: Soundex looks like the way to go I think. Thanks guys!
posted by BigCalm at 8:21 AM on July 26, 2004


I wouldn't go for pure soundex, you'll get waaay to many false positives. Soundex has its uses, but this isn't one of them I'm afraid.
posted by fvw at 10:28 AM on July 26, 2004


I've heard soundex disparaged elsewhere as well, though this is the only place I can recall and reference for you right now.
posted by weston at 3:34 PM on July 26, 2004


You can easily check it out for yourself, just do "select soundex('wordofyourchoice');" in your favourite database. Or go to dict.org and try a few lookups using soundex.
posted by fvw at 2:27 AM on July 27, 2004


Response by poster: I've done some experimenting, and soundex is frankly rubbish, but metaphone is far more promising (though this has notable failures, but far fewer than soundex).

select soundex(field) is not available in all db servers (notably informix, though it's very easy to add), and I doubt that select metaphone(field) is available in any.
posted by BigCalm at 4:06 AM on July 27, 2004


« Older LiveStrong Song   |   Name this S. African artist / film maker Newer »
This thread is closed to new comments.