How to use kl-divergence for distributions with some non-overlapping elements? January 11, 2010 4:10 PMSubscribe

How do I find the divergence between 2 distributions if there are elements in distributionP that are not in distributionQ and vice versa? I would typically use kl-distance but if I use that directly and disregard the divide-by-zero, I can get negative values (which should be impossible).

For example, if distributionP is {'t', 'e', 's', 't', '1'} and distributionQ is {'t', 'e', 's', 't'}, then calculating the kl-divergence and ignoring the '1' will result in a negative value, according to my calculations.

If there are elements in distP that are not in distQ but every element in distQ is in distP, the KL(P,Q) is finite, though KL(Q,P) is not. Maybe that can help you. (Note that when you're implementing it, you have to be a little careful to avoid a 0 * infinity computation that could give you a NaN floating point number; mathematically however, you are in the clear).

That said, what are you actually trying to do? You say 'the divergence' but there are many different divergences such as Total Variation : TV(P,Q) = 1/2 * sum_x | P(x) - Q(x) | or Hellinger. Or, there are plenty of other functions that you can write down. However, without knowing what properties you want from your divergence, it's difficult to say which is the appropriate one. posted by bsdfish at 4:32 PM on January 11, 2010

I'm trying to capture character overlap between words, but edit distance is not exactly what I want. I changed my metric to TV(P, Q). Thanks so much! posted by tasty at 4:56 PM on January 11, 2010

You are trying to capture the overlap between words treated as bags of symbols rather than strings? That is, the order within the word is not important? How about the bag distance metric?

Without knowing anything else about your problem, treating words as discrete probability distributions in the way you are doing seems bizarre. posted by grouse at 5:15 PM on January 11, 2010

« Older Will our age gap be something ... | Where can I buy more cute 1&qu... Newer »

That said, what are you actually trying to do? You say 'the divergence' but there are many different divergences such as Total Variation : TV(P,Q) = 1/2 * sum_x | P(x) - Q(x) | or Hellinger. Or, there are plenty of other functions that you can write down. However, without knowing what properties you want from your divergence, it's difficult to say which is the appropriate one.

posted by bsdfish at 4:32 PM on January 11, 2010