: Help me find a distance measure that will account for the similarity between strings of different lengths.
Here's the elevator-speech version: I'm trying to cluster sentences from about twenty Elizabethan plays together based on how similar their grammatical structures are. To that end, I've compiled a database of the sentences from an XML corpus that has each word tagged with its part of speech. For instance, the sentence "Now Faustus, what wouldst thou have me do?" has the structure "av np pu q vm p vh p vd pu".
So far, so good. The problem is that since sentences are such flexible, modular things, there's no hard-and-fast way to assign a sentence into a particular category. What I've finally settled on is clustering to assign sentences to categories by their similarity--most likely k-medoid clustering, since my original approach, hierarchical agglomerative clustering, was hugely time-consuming. (On the order of O of n^2, for those of you into that sort of thing.)
My problem arises when trying to compute similarities and/or distances between the sentences. I originally was trying Levenshtein distance
, but it seems to be skewing the results for short but structurally-different sentences, even after I reduced the part-of-speech tags to single alphanumeric characters to eliminate noise from different-length tags. For instance, I'm getting "Fie, Publius, fie!" (POS tags "uh pu np pu uh pu", encoded as "TPJPTP") put in the same cluster as "Once more adieu!" ("av av uh pu", "AATP"), which shouldn't really be happening--but the edit distance is so much smaller between them and the longer sentences that they're getting dropped into the same bucket.
I've started toying around with things like cosine similarity, and to that end have reduced my sentences to n-dimensional frequency-of-occurrence vectors for each POS tag...but I'm wondering if there's a better measure out there that I just haven't heard of. Can anyone point me in the right direction? Thanks in advance, and please let me know if you need more details--I tend to type these things out and leave out things.