Large, approximate string matching algorithm
February 12, 2010 11:41 AM Subscribe
I have two large (10k-100k) sets of strings. I want to match the items in the sets up in pairs based on similarity. Levenshtein looks promising but costly.
I have a set of strings. Say:
kitten55
buicks
aspirin
And another (possibly of different length):
bucks
smitten45
aspire
ice cream
I want to pair them up based on similarity:
kitten55 - smitten45
buicks - bucks
aspire - aspirin
ice cream -
The Levenshtein distance algorithm seems like a good approach but is O(NM) on string lengths in run time, and would need to run across every possible pairing (well, not exactly but still essentailly O(N^2)). This seems awful, given the 10K-100K size of my input sets.
Is there a better way? Can I compile a string into a DFA of some sort suitable for approximate match scoring that I can reuse, rather than trying to evaluate each possible pairing? I'm open to other solutions as well.
posted by chairface to technology (15 answers total) 4 users marked this as a favorite
If you only need to run it once, would it take more time for you to build a "good" solution than it would to run the naive solution once?
posted by chrillsicka at 11:47 AM on February 12, 2010