Large, approximate string matching algorithm
February 12, 2010 11:41 AM   Subscribe

I have two large (10k-100k) sets of strings. I want to match the items in the sets up in pairs based on similarity. Levenshtein looks promising but costly.

I have a set of strings. Say:

kitten55
buicks
aspirin

And another (possibly of different length):

bucks
smitten45
aspire
ice cream

I want to pair them up based on similarity:

kitten55 - smitten45
buicks - bucks
aspire - aspirin
ice cream -

The Levenshtein distance algorithm seems like a good approach but is O(NM) on string lengths in run time, and would need to run across every possible pairing (well, not exactly but still essentailly O(N^2)). This seems awful, given the 10K-100K size of my input sets.

Is there a better way? Can I compile a string into a DFA of some sort suitable for approximate match scoring that I can reuse, rather than trying to evaluate each possible pairing? I'm open to other solutions as well.
posted by chairface to Technology (15 answers total) 4 users marked this as a favorite
 
How many times do you have to run it?

If you only need to run it once, would it take more time for you to build a "good" solution than it would to run the naive solution once?
posted by chrillsicka at 11:47 AM on February 12, 2010


This sounds an awful lot like a spelling corrector. Perhaps you should take a look at some of the existing literature on them.
posted by brain at 11:51 AM on February 12, 2010


Check out this article: How to Write a Spelling Corrector
posted by brain at 11:52 AM on February 12, 2010


Depending on just how close the strings tend to be and their distribution in terms of length you could save a some time by sorting them by length and putting them into buckets. Then, for each input string you only have to compute the distance to each string in the nearby buckets. Just how far out you search depends on your data and how accurate you want to be (e.g., you could have a corner case where the closest string to "abcdefghijklmnopqrstuvwxyz" is actually "abcd", which would be missed if you only searched strings of length 16-36).
posted by jedicus at 12:19 PM on February 12, 2010


Algorithms on Strings, Trees, and Sequences by Dan Gusfield has an algorithm for up to k-difference (mismatches + indels) inexact matching using a suffix tree. This runs in O(km) time, where I believe m is the size of your whole subject text. Of course you would have to multiply that by the number of strings you are looking for. Hopefully there is an even better method available.
posted by grouse at 12:31 PM on February 12, 2010


This is way-overkill, but Pairwise Document Similarity in Large Collections with MapReduce solves a much bigger version of your basic problem.
posted by xueexueg at 12:39 PM on February 12, 2010


Seconding/thirding writing a semi-naive implementation of this in a decently fast but somewhat friendly language. You could also go for any kind of nice bound interpretation for this.
posted by tmcw at 12:59 PM on February 12, 2010


Response by poster: Excellent tips, thanks everyone. The problem set will probably be run several times (maybe hundreds) but I supposed a slow approach is actually ok.

I have, of course, simplified the sample set for illustrative purposes. The real data will be file paths in a large project. Directories and files in that project will have release and date/time information:

/project/foo/build478234_20101002_1222/something/blah/build_jdoe/debug

I need to figure out what file that would match in a parallel tree, built under different circumstances. Its paths may look like:

/project/foo/build468244_I20090304_A/something/blah/build_qpublic/debug

Once I figure out that the two files are probably the same thing, I then need to compare those files. That's not the topic of this post though.

Heuristics would be how I would normally break this down but the size of these project trees is daunting. I may be going about it all wrong but it's still an interesting problem.
posted by chairface at 1:38 PM on February 12, 2010


From your example, I'm guessing that many (all?) of these files will have identical filenames, and will just be under different directories. In that case, you could create a hash by filename, store all of the full paths in a corresponding array, and then just run your distance metric on those. You reduce the number of comparisons dramatically that way.
posted by chrisamiller at 2:20 PM on February 12, 2010


chrillsicka has given you wise advice, and I'll follow on from tmcw by adding another programming language (the XS version uses a compiled C core, the non-XS is pure Perl).

I did some nosing around regarding longest common subsequence/substring, but it doesn't look like there's anything better than grouse's approach in complexity. But there are libraries and code examples there too, so you don't necessarily have to write everything from scratch.

Meanwhile, here are a couple variants on jedicus's bucketing scheme. (1) Get a letter-frequency table, and put each string in buckets corresponding to its three rarest characters (maybe four). Compare a string from your first set only to the strings in the second set with which it's indexed. Of course, strings without rare characters are still going to match pretty promiscuously, and there's the off chance that two strings would differ only in their three rarest characters. And, of course, this breaks if the distribution of characters is flat, but I've inferred from your examples that the strings are wordlike.

(2) Pull out some short substrings from each string (again, works better if they include rarer characters). From 'lemonade' we might grab 'lem' and 'mon'. Compile each into a regex that will match that subsequence or fail without (much) backtracking, i.e., l[^e]*e[^m]*m to match those letters in that order, possibly with insertions between, and unless there are a zillion lowercase Ls in the target string it will match or fail lightning-fast. If a potential pairing passes one of these simple, cheap tests you can take the time for a Levenshtein; if not, it wouldn't have gone well anyway, so forget about it. Tradeoff here is that all that if-then logic might take more time than a really optimized Levenshtein would in the first place.

On preview: If your example paths are typical, this just got a whole lot easier. You get big, juicy constant substrings by grabbing contiguous swaths of alphabetic characters; variation is either confined to numbers, or separated from constant bits by punctuation. The more of those that match, and the longer they are, the higher you score. It's still O(nm), but on number of alphabetic substrings rather than on length of string, and you can quickly match up constant strings with a hash table. Perl to follow.
posted by eritain at 2:28 PM on February 12, 2010


Here's how I'd score a couple of paths against each other:
my %substringA;
$substringA{$_} = 1 for $pathA =~ /[a-z]+/g;
$score += length for grep {$substringA{$_}} $pathB =~ /[a-z]+/g;
Explained, for non-Perlers in our studio audience: [a-z]+ grabs one or more contiguous lowercase letters. By default it will grab as many as possible. We grab a list of all such matches from $pathA and store each one as a key of the hash %substringA (with the value being 1). We grab out a comparable list from $pathB, and use grep to retain only the ones that are also found in %substringA. And we iterate over those to add their length to the score.

Now, you're going to need to wrap that in a couple of loops to step through the possible values of $pathA and $pathB. And you're going to want to store away the value of $score so you can consult it later. Up to you whether you keep all the values, or just keep track of the top score so far. But this is the gist of it.

You may also want to pre-calculate the lists of matches, rather than perpetually repeating that operation in your inner loop; but that's a memory/speed tradeoff and depends on your particular machine.

I do love Perl.
posted by eritain at 3:03 PM on February 12, 2010


Why not precluster the individual strings with themselves. Then only search within matching clusters.
posted by norabarnacl3 at 4:20 PM on February 12, 2010


The real data will be file paths in a large project

If you find a generic string-matching approach to your problem, then great. However, I would like to point out that your data has a lot more structure than generic strings, structure that you should probably exploit. Namely, whereas a string is merely a sequence of characters, your file paths are a sequence of sub-directories (which are themselves strings), each living inside a directory tree.
posted by mhum at 4:33 PM on February 12, 2010


Using LD(a,b) to represent the Levenshtein distance between strings a,b then

for strings a,b,c I believe

if LD(a,b) = x and LD(a,c) = y then LD (b,c) = abs(x-y) ... (x+y)

Using this, you should be able to get to an O(NlgN) [or maybe even O(N)] method by ordering your computations of LD(). For example on your first iteration compute LD(a,b) for a selected a and all b's. Bin on this value (you will only need k bins where k is the length of the longest string). For your next string c, look only a strings b which have LD(a,b) numerically close to LD(a,c). So you start looking at all of the strings in the same bin as c and then continue until you find an LD value which is equal to the difference between the LD values defining the bins. Each iteration gains you more information at approximating the LD() for the remaining strings and thus narrows your search range further.
posted by NoDef at 4:52 PM on February 12, 2010 [1 favorite]


seems like you might have some info (structure/constraints) that you're not making use of. If it's directory trees, there'll be many strings with the same prefixes, and if it's build trees, they'll have a similar shape. You might look at "subtree matching". You say you want to pair paths, but are you really interested in the paths, or the values at the leaves, the files?
posted by at at 10:41 PM on February 12, 2010


« Older Stop nursing now or taper off more?   |   Grouping technology skills on a resume Newer »
This thread is closed to new comments.