What distance measure should I be using for string similarity?
November 6, 2012 3:03 PM   Subscribe

TextMiningFilter: Help me find a distance measure that will account for the similarity between strings of different lengths.

Here's the elevator-speech version: I'm trying to cluster sentences from about twenty Elizabethan plays together based on how similar their grammatical structures are. To that end, I've compiled a database of the sentences from an XML corpus that has each word tagged with its part of speech. For instance, the sentence "Now Faustus, what wouldst thou have me do?" has the structure "av np pu q vm p vh p vd pu".

So far, so good. The problem is that since sentences are such flexible, modular things, there's no hard-and-fast way to assign a sentence into a particular category. What I've finally settled on is clustering to assign sentences to categories by their similarity--most likely k-medoid clustering, since my original approach, hierarchical agglomerative clustering, was hugely time-consuming. (On the order of O of n^2, for those of you into that sort of thing.)

My problem arises when trying to compute similarities and/or distances between the sentences. I originally was trying Levenshtein distance, but it seems to be skewing the results for short but structurally-different sentences, even after I reduced the part-of-speech tags to single alphanumeric characters to eliminate noise from different-length tags. For instance, I'm getting "Fie, Publius, fie!" (POS tags "uh pu np pu uh pu", encoded as "TPJPTP") put in the same cluster as "Once more adieu!" ("av av uh pu", "AATP"), which shouldn't really be happening--but the edit distance is so much smaller between them and the longer sentences that they're getting dropped into the same bucket.

I've started toying around with things like cosine similarity, and to that end have reduced my sentences to n-dimensional frequency-of-occurrence vectors for each POS tag...but I'm wondering if there's a better measure out there that I just haven't heard of. Can anyone point me in the right direction? Thanks in advance, and please let me know if you need more details--I tend to type these things out and leave out things.
posted by Mr. Bad Example to Science & Nature (5 answers total) 4 users marked this as a favorite
Perhaps looking at the edit distance as a percentage of the string distance would be more fruitful?

(I should point out that this is completely new to me).

Levenshtein distance doesn't strike me as the right metric here. First, it doesn't take into account that all changes to a sentence are not equal. If you have two adjectives in front of a noun then adding a third one is a minor change, but adding that first adjective is quite a big one. Switching to a completely different part of speech is also a big deal.

Can you divide sentences up into simple, compound, complex, compound/complex? Perhaps computing the distances only on the clauses and not on the sentence as a whole?
posted by It's Never Lurgi at 3:41 PM on November 6, 2012

I assume you know this, but you can adjust the cost of different operations in edit-distance algorithms -- for instance, you might say that addition or subtraction are the cheapest, transposition is slightly more expensive, and substitution is much more expensive.

In general, though, if you're not getting the results you want, you might want to go about the problem from the other direction. Like, get a list of a bunch of sentences, and classify them by hand (maybe write a quick program to show you two sentences and ask you to rate them 1-5 for similarity, save the value, then repeat), and then once you have a bunch of data, look at the sentences that are similar and see if you can articulate why you think they're similar, and code that up as a comparison algorithm.

Yet another approach is to change the granularity -- maybe you could count the number of each exact grammatical structure (this scene has 11 instances of "av np pu q vm p vh p vd pu") and compare the histograms of different texts.
posted by inkyz at 3:46 PM on November 6, 2012

Consider pairwise similarity on POS bigrams. You could go to 3- or 5-grams but I've always found that simple 2-grams give me the best results I could want. E.g. http://aclweb.org/anthology-new/P/P08/P08-2067.pdf
posted by xueexueg at 3:49 PM on November 6, 2012

One of the things you should consider is what you mean by similarity. That is, there are a bunch of candidate metrics that you could explore, but the right one(s) will reflect some underlying sameness or differentness that captures what similarity means to you.

If you go with frequency-based approaches (as you are doing with cosine similarity), you lose information about order. So, one question you need to ask yourself is: does order matter?

Levenshtein distance accounts for order (but I agree that it probably is the wrong metric for you), frequency vectors don't. Does a 'bag of words' approach satisfy you, or not?

n-gram based approaches may be a better approach, and as xueexueg mentions, you can often get away with the (relatively) computationally inexpensive bigram approach.

One approach that could be worth exploring (I've done stuff like this for other graphs, not transition probability matrices, and this is just me riffing off the top of my head) would be to construct transition probability matrices for the sentences, then use a metric like Hamming distance or Jaccard similarity on the dichotomized matrix, or another difference score on the weighted matrix as the basis for a clustering approach. (Maybe there's something in the Markov model literature).

The strength of this approach would be that it captures the size normalization of your cosine approach but still maintains the bigram dependencies of xuexueg's approach. (Although as I think more about it, this may just be restating a straightforward bigram comparison... I'd have to give it some more thought).

There is a substantial sequence analysis body of research, and perhaps dipping a toe into that literature would be helpful.
posted by i love cheese at 6:48 PM on November 6, 2012

I'm voting with xueexueg here. Use tag bigram frequencies as your features, and then do something like cosine similarity in that feature space.

(Cosine similarity is good for this sort of thing because it ignores distance from the origin. If your features are n-grams, then longer sentences are always further from the origin than short ones, and that can skew your results if you use something like Euclidian distance for similarity.)
posted by nebulawindphone at 9:53 PM on November 6, 2012

« Older Counter culture shock and its bag of tricks   |   What was behind that door?! Newer »
This thread is closed to new comments.