Describing an algorithm in stats speak.
December 16, 2005 11:19 AM   Subscribe

Any statistics wizards around? I have a algorithm I need to describe in the language of stats and all the books are full of words and symbols I don't understand.

Here's what my program does. I've got two lists of variables with integer values. There could be any arbitrary number of variables of any name.

Example:
List 1: cat=3, dog=10, rock=9
List 2: cat=4, fish=3, rock=2

Now want to do a calculation of the degree of similarity between the two lists. So I came up with this.

Total (all variables added together)=3+10+9+3+3+2=30
Overlap total (overlapping variables added together)=3+4+9+2=18

Divide=18/30=0.6

So list1 and list2 have a 0.6 (60%) degree over overlap.

If the lists are identical, the above will give a 1.0. If they have no overlap, it's a 0.0. Anything in between will give a degree.

The question is, what am I doing here from a statistics perspective? Is this even a good test?

It's been awhile since I took this type of thing in school and haven't found a low-brow enough book to get me up to speed.
posted by Leonard to Science & Nature (23 answers total)
 
I barely remember any statistics but I think you might need something like a t-test, which tells you if the means of two groups are statistically similar.
posted by EiderDuck at 11:27 AM on December 16, 2005


Response by poster: Ah ha, yup, and according to the wikipedia the alternative to the T-test for indepent samples if the
Mann-Whitney U
test, which looks almost identical to what I'm doing and is related to some p concept which is used for studies of categorization, which is also what I'm doing. Excellent. All hail askmefi and the wikipedia.
posted by Leonard at 11:42 AM on December 16, 2005


Been a while since I've done much stats, but I don't think this is a very good test.

Consider these lists:
List 1: cat=10, dog=1
List 2: cat=1, dog=10

Total = 22
Overlap = 22

22/22 = 1

Then again, maybe I misunderstand what you're trying to test.
posted by sanko at 11:44 AM on December 16, 2005


If it does not matter for your purposes that

List1: var1=1, var2=1 ... var300=1
and
List2: var1=300

will be considered identical then it is a good test for what you're doing.

If not a t-test might be a good idea but I think you can only use it when your data sets are at least approximately in normal distribution.
posted by snownoid at 11:47 AM on December 16, 2005


Response by poster: Humm. Good points. I'm not if this matters. What I'm trying to do is compare lists of tags on del.icio.us for a social science related project. Two users have lists of tags, I want to measure the degree of similarity between them in some useful way.
posted by Leonard at 11:53 AM on December 16, 2005


This page contains a list of many similarity metrics and an open source java library that implements them.
posted by EiderDuck at 12:12 PM on December 16, 2005


Response by poster: Okay, yes, this is a bad test. What I appear to need is some form of non-parametric statistics?
posted by Leonard at 12:13 PM on December 16, 2005


Assigning integers as you've done won't make sense. The values you get for similarity will depend overwhelmingly on the values you assign to different words. If they don't share FOO and BAR, but you've assigned them values of 1 and 2, then two people will look very similar. If you assign FOO and BAR values of 1782983 and 9908450982, they will look very dissimilar.

Quick and dirty:

For any two users, you can just count the number of tags they share in common over the number of total tags between the two. This is similar to what you're doing but without the integer stuff.

A better thing to do would be to employ some sort of scaling or factor analysis on the lists and reduce the data down to something more manageable. I know people have done work along those lines with word choices, but I don't know much about it. In the end, this would tell you that List A and List B were similar on Scale 1 but not on Scale 2, or something similar.

I imagine that you could create one row for every person you're looking at, and then one column for every word that appears in anyone's tag list. Then you could treat each person-word as a dummy variable -- 1 if they used it, 0 otherwise. This would presumably be a very large dataset. Scaling / factor analysis / other data reduction would be easier since it would just be over a field of dummies. But again, I haven't read about statistical analysis of word choice except to know that it's out there.
posted by ROU_Xenophobe at 12:14 PM on December 16, 2005


For the tags you could maybe use something similar to the Edit Distance, i.e. determine, how many insertions, deletions or transformations of tags are needed to transform user1's list into user'2 list.
That does not really help with the tag counts, though.
posted by snownoid at 12:18 PM on December 16, 2005


Response by poster: Forget I said integers, my bad. I just meant that people couldn't use a tag a fractional amount of times. :)
posted by Leonard at 12:20 PM on December 16, 2005


You can use a Wilcoxon Signed Rank Test, a Paired t-test or a Fisher Sign Test as meaningful statistics, depending on whether magnitude and/or sign are important. What you would need to do is to consider how to handle the unscored variables. One way is to assign them a score which is median, in the middle of other scores. Another is to not compare non-overlapping varaibles.

There are other things to consider though. For example, do people assign scores the same way? Some people might only give low scores while others only high scores. So you might want to normalize the scores. You might also want to only consider comparing people with sufficient overlap who have assigned scores to enough variables.
posted by blueyellow at 12:22 PM on December 16, 2005


You might want an algorithm that gives you a metric: a measure of distance between two lists. In engineering circles, a lot of people would use mean squared error (MSE) for this purpose (square the differences between all of the list elements, and take the average).

For your example, I'd start by adding zeros to the lists for missing elements:
List 1: cat=3, dog=10, fish=0, rock=9
List 2: cat=4, dog=0, fish=3, rock=2
Then the MSE is ((3-4)^2 + (10-0)^2 + (0-3)^2 + (9-2)^2)/4 = 39.75 .

Two matching lists would have MSE of 0.
posted by Galvatron at 12:24 PM on December 16, 2005


Response by poster: Good point Blueyellow. I may have to rethink this whole deal. The number of times a person has used a tag probably isn't really a good weight. If a tag is present says something, how often it is present might not be relivant for comparison.

The MSE test is interesting, does about the right thing, but I'm not sure how to interpret it. It gives a 0 for no similarity as well as 100%.
posted by Leonard at 12:45 PM on December 16, 2005


It gives a 0 for no similarity as well as 100%.

I don't see that. Can you give an example where two lists with no similarity have MSE of 0?
posted by Galvatron at 12:54 PM on December 16, 2005


Response by poster: Sorry, my bad again. It gives zero when both users have no tags, which is correct.
posted by Leonard at 1:02 PM on December 16, 2005


Response by poster: Unscored values is indeed the problem. If you assign them a value of zero, assuming I haven't screwed this up again:

list1: v1=1 v2=0 v3=0
list2: v1=0 v2=1 v3=1

There is no overlap, but MSE (1+1+1)/3=0.3
posted by Leonard at 1:23 PM on December 16, 2005


Response by poster: Yeah, I screwed that up again. My kingdom for an edit feature.
(1+1+1)/3 is of course 1.
posted by Leonard at 1:24 PM on December 16, 2005


Blueyellow's correct. In this case, given nonparametric paired data, the Wilcoxon's definitely the right test.

God knows what you're trying to do with your sums in your "more inside," but even ignoring the fact that you typed a 3 where you meant a 4, those numbers aren't going to get you where you need to be.

I can't recommend Altman's "Practical Statistics for Medical Research" highly enough. If a doctor can understand it, anyone can understand it; and I found it to be one of the best and most lucid textbooks on any subject that I've ever read.
posted by ikkyu2 at 1:38 PM on December 16, 2005


Response by poster: To conclude. Thanks everyone, this has been so much more useful then the morning I spent going over 'stats for social scientists' books. It looks like I should use an algorithm that only looks for the presence of tags, not the amount. So I can just count the number of overlapping tags and voila. No stats. :)
posted by Leonard at 1:41 PM on December 16, 2005


It seems to me you really want a pairwise similarity coefficient, such as Jaccard's, which will look at your list as a series of yes/no answers to the questions cat? dog? horse?

It will then add up all the matches to create a measure of similarity. The key here is, do you want to measure cat=cat on the two lists the same as no cat = no cat on the two lists? Is a shared absence an indicator of real similarity? if you really only have two lists then this concern may be trivial, as soon as you have three or more it becomes a real concern.

What you are proposing to do is like a very simple case of a cluster analysis, or the first step of one, which always involves the creation of a similarity matrix of some description.

Second the "you can't use t-test" etc on these kind of non-parameteric data, your very simple difference of means test *may* have some meaning if you indeed only have two lists. I am pretty sure Mann-Whitney U requires *ranked* ordinal values, so unless a dog is worth twice a cat, that's not so applicable

OK, I just saw your clarification. Use cluster analysis to create groups of users based on the similarity of their tags. It will compare each user to each other user, calculate a similarirty between those pairs of users, then create a dendrogram representing the structure of heterogeneity within that array of similarities.
posted by Rumple at 3:09 PM on December 16, 2005



ABoverlap = 0;

for each of user A's tags {

for each of user B's tags {

if (user A's tag == user B's tag)
increment ABoverlap;

}

}

output ABoverlap / (# of A's tags * # of B's tags)

?
posted by sergeant sandwich at 4:27 PM on December 16, 2005


apologies for the crappy formatting. there were supposed to be indented nests there. anyway, am i missing the point?
posted by sergeant sandwich at 4:29 PM on December 16, 2005


Are you making it too complicated? Example: A has 20 tags ; B has 15 tags; 6 of those tags are identical between A and B. The similarity is 6/(15+20)? [actually, there are 15+20-6 unique tags, so it should be 6/(15+20-6)]? Is it that simple?

My algorithm would be:
Combine both lists of tags into one list - the total number is N. Sort it alphabetically. Count the number of duplicates in the list - this equals 2M (M unique tags, each listed twice). The A-B similarity is M/(N-M). Identical list = 100%, no overlap = 0%.

Possible problem? If A's list is long, and B has a short list which matches part of A's perfectly, overlap is still small.
posted by mediaddict at 8:11 PM on December 16, 2005


« Older Archiving WAY too many files   |   Grownup skincare regimen? Newer »
This thread is closed to new comments.