Use statistics to prove that one kitten is cuter than another
August 13, 2010 12:58 PM   Subscribe

I don't know enough statistics to prove that a certain kitten at kittenwar.com is cuter than another, but I know in my heart that it is true. Any statisticians out there wanna help me out?

Okay, over at kittenwar.com, they endlessly show you two kittens and let you pick which one you think is cuter. It keeps track of how many battles each kitten has won, and you can see which kittens have won the highest percentage of battles, on the "winningest kittens page.

Now, in my opinion, the cutest kitten is Joe Dirt. I am not alone in this opinion. Joe Dirt has won 3345 of 4368 battles (77%). For many years, he has been in the top ten of the "Winningest Kittens" page.

Well, sometime since the last time I looked, he's been pushed out of the top ten, and you now have to go to the second page to see him.

The thing is, sure these brash young upstarts have won a higher percentage of their battles than venerable old Joe Dirt. (They've won 81%, 78%, 78%, respectively). But they've won a high percentage of a substantially smaller number of battles! (113, 487, and 461, respectively, to Joe Dirt's 4368).

So I would argue that Joe Dirt is the cutest, as shown by his "staying power".

Somehow, I got a BS in computer science without ever taking a statistics class.

Is there a way to mathematically formulate the idea of "staying power" to show that Joe Dirt is the real winner?

Thanks.
posted by Galaxor Nebulon to Science & Nature (20 answers total) 21 users marked this as a favorite
 
Some sites use a Bayesian average to try to work around the problem of different number of votes (battles) for different items.
posted by smackfu at 1:09 PM on August 13, 2010


You may want something like the True Bayesian Estimate, as used to calculate the IMDb top 250 (scroll to the very bottom to see the formula repeated there). See also this Google Answers discussion of potential alternatives to the IMDb formula that solve the same problem.
posted by Partial Law at 1:11 PM on August 13, 2010


Sure, it would be possible to come up with some kind of more complicated formula that would put Joe Dirt on top--e.g., you could set an arbitrary minimum number of victories that excludes one or more of the Johnny-come-latelies, or create a metric that measures total number of victories multiplied by winning percentage multiplied by the number of David Spade title characters the kitten shares its name with. Some people would argue that these kinds of systems lack the elegant simplicity of just counting winning percentage, or that the Spade thing would unduly favor kittens named Dickie Roberts or Joe Dirt.

But by the criteria that the folks at kittenwar have decided on, Joe Dirt is not one of the top ten winningest kittens.
posted by box at 1:13 PM on August 13, 2010 [1 favorite]


You will get a more detailed answer than you ever wanted if you ask this over at the Statistical Analysis stackexchange. It's frequented by a ton of really good statisticians, and I'm sure they'd love a fun diversion like this.
posted by chrisamiller at 1:14 PM on August 13, 2010 [1 favorite]


Best answer: The correct answer to this problem is to use the lower bound of the Wilson score confidence interval for a Bernoulli parameter.

Details

Reddit uses this to sort comments, and it works really well. The above article should be forced reading for all webmasters.
posted by Mwongozi at 1:19 PM on August 13, 2010 [7 favorites]


I would just like to mention that I agree that kittenwar uses flawed methods as evidenced by this outrage and thank all of the folks above who provided answers on exactly why.
posted by Maisie at 1:28 PM on August 13, 2010 [1 favorite]


I think Mwongozi is right, the Wilson score CI looks like a good answer to your question.

That said, the goal of that CI is to "balance the proportion of positive ratings with the uncertainty of a small number of observations", and I don't think (but didnt check) that the number of votes that the upstart kittens have received (113, 487, 461) is "small enough" to make a difference in the rankings...
posted by JumpW at 1:34 PM on August 13, 2010


Looking at their top kitten page it looks like all of their kittens are only at about 200-300 battles.

Now, IF there's a LOT of kittens competing and IF theres a high influx of new kittens at the site and IF the winningest kitten is supposed to be the kitten that an average kittenwar user think is cutest (disregarding any problems with transitiv preferences).

THEN you could probably argue in good faith that all the top kittens are just outliers and that any kitten would need at least a thousand votes to be eligible for the coveted top spot. Higher if theres a LOT of new kittens on the site.
posted by Greald at 2:11 PM on August 13, 2010


That raises another issue: the chance of a kitten being chosen for a battle is only 2/n. So the more kittens get added to the system, the harder it is to get a high battle count. That makes it questionable whether it is really "fair" to use the battle count as a factor in the ratings, since it would tend to favor kittens that were added to the system earliest.
posted by smackfu at 2:15 PM on August 13, 2010 [1 favorite]


I'd say give the price to the kitten that have proven his or her mettle and stuck it out through the lean times.

Otherwise you'd get an influx of unproven and unseasoned noob kittens, all just getting a glimpse of fame through sheer luck. Only to be relegated to the dustbin of kitten history with the rest of the old timers.

Nope, the winner should have faced a certain number of battles, and persevered through blood, sweat and pure determination. rather then blind luck.
posted by Greald at 2:36 PM on August 13, 2010


That losingest kittens site is hilarious. The losiest kittens in Loserville.
posted by yeti at 2:53 PM on August 13, 2010


Not enough info from the website to define "staying power" for our purposes imho. Like how he has fared in his battles over time. Perhaps he is losing more now as his cuteness fades?
posted by mandymanwasregistered at 2:54 PM on August 13, 2010


Best answer: What would happen if you did a t-test?

Say a win = 1, lose = 0, draw = .5

Joe Dirt. n1 = 4368 battles:
Won: 3345
Lost: 699
Drawn: 324
sample mean 1 = 0.802884615 (yay infinite significant figures)
std dev 1 = 0.0427772349

Larry. n2 = 118 battles:
Won: 95
Lost: 11
Drawn: 12
sample mean 2 = 0.855932203
std dev 2 = 0.0315282965

Enter those values into a calculator like this one.

So, assuming equal variance, we have: [Joe Dirt group's mean] > [Larry group's mean] with a p-value > .9999

So there's that. A bit of a weird way to apply the t-test, I think. In this case we're comparing the Population of folks exposed to Joe Dirt, vs Population of folks exposed to Larry. Then we do the t-test to compare the two populations' average cuteness detection. (So there's a logical leap to which is the cuter kitten.) Also the groups aren't really independent...
posted by sentient at 3:56 PM on August 13, 2010


I would view this as a classic hypothesis test question. The proper procedure in this case is a test for comparing two proportions.

Of criticial importance is stating the null hypothesis and the alternative hypothesis. In your case, the null is that Joe Dirt is no cuter than some other kitten, say Larry. The alternative hypothesis is that Joe Dirt is cuter than Larry.

To use this test, I will ignore the draws and consider only wins and losses. Thus Joe Dirt's win/loss proportion is p1 = 3345/(3345+699) = .827 and Larry's is p2 = 95/(95+11) = .896. So we can state our hypotheses as H0: p1 less than equal p2 and H1: p1 greater than p2.

Plugging these values into the equation for the test statistic, we get a z score of -1.86. Because p2 > p1, we clearly cannot reject the null here. In other words, there is not sufficient evidence to conclude that Joe Dirt is significantly cuter than Larry.

What if the null and alternative had been swapped? In other words, is there sufficient evidence to conclude that Larry is cuter than Joe Dirt? The same test statistic gives a p-value of .0311. So at a standard signficance level, we would conclude that, yes, there is sufficient evidence to conclude that Larry is cuter than Joe Dirt.

Intuitively what this is saying is that Larry has been in enough battles to allow us to conclude that it is quite likely that his proportion is actually higher than Joe Dirt.

The bottom line is that the stats are not on your side on this one!
posted by notme at 6:59 PM on August 13, 2010 [1 favorite]


Joe Dirt might be cuter in every way, but at some point, users will tire of his antics and start choosing other kittens just for the variety of it.

Also: kittens are cutest when people are ugliest: at the awkward adolescent stage. Their ears have gotten bigger and moved to the top of their heads; their eyes are still proportionally larger, but they *have* changed from their birth blue-and-round to their adult color and iris shape, mostly; their limbs are very close to adult proportions; but they maintain some of that kitteny fuzziness.
posted by gjc at 7:38 PM on August 13, 2010


Forgot to mention: always cuter when they are contained in things, or in some kind of action shot.
posted by gjc at 7:41 PM on August 13, 2010


I recall this statistical problem well, from when I was ten years old, and the top 40 AM radio station I listened to would have a listener's poll of the greatest songs of all time, and some banal generic now-forgotten ballad from ((then) last-week) would be hangin' out in the top ten with Hey Jude and Stairway to Heaven. Man that sucks.
posted by ovvl at 8:28 PM on August 13, 2010


The best way to determine whether Cat A is cuter than Cat B is to see the results of their head-to-head matches.

The problem with other methods is that you're not comparing apples to apples.

Hypothetical example of how this could go wrong: Joe Dirt had most of his matches against the elite original batch of cats, when it was harder to win. Over time, the site got flooded with people uploading pictures of their mediocre cats, thus making the pool less competitive over time. At this point, it's easy for a brash upstart who's decently cute to join and rack up a super high percentage of wins.

One could test this out by re-uploading a picture of Joe Dirt, and seeing what percentage of matches he wins when he's pitted against today's talent pool.
posted by lunchbox at 10:15 PM on August 13, 2010


Best answer: Hi! I'm the person who coded Kittenwar and I thought I'd clear up some of the technical details behind how the site works, in case anyone's interested. Before I do that I'd like to point out to any real computer scientists and statisticians here that a) I am a self-taught codemonkey and I am painfullly aware that my knowledge is full of gaps and b) I don't have the same amount of free time now that I had when I originally wrote the site, so changes and improvements to the site don't happen very fast nowadays.

There isn't any complicated statistical analysis going on with the winningest or losingest lists. We order all kittens who have competed in more than a hundred battles by the proportion battles they have won (or lost) out of their total battles. That's it. Because of this you're more likely to see recent kittens at the top of the list. The more battles that a winning kitten competes in, the more their score tends downwards, so even the cutest kitten's score drops slowly over time. I'm not clever enough to give this phenomenon a name or explain why it happens in detail. It suits the purposes of the site, though, giving new kittens a chance to show up near the top of the list, and making the lists themselves more interesting. It'd be dull if the list stayed the same all the time.

One other factor affects the lists: how we select the kittens for each battle. We used to just pick a pair at random, but with almost 110,000 kittens on the site now that would have meant that recent kittens would take months or years to get anywhere near the 100-battle qualifying criterion for the winningest and losingest lists. So now we select the kittens using a weighted random number generation algorithm based on a Box-Muller transform which really made my head hurt when I put it together. What that actually means is that the more recently a kitten has been added to the site, the more likely it is to get picked to fight. There's still a small chance for all the kittens on the site to compete, though, so the older kittens don't lose out completely. This system also reinforces the tendency of the winningest and losingest lists to contain more recent kittens.

So, yes, the system we use to compile the winning list doesn't actually allow you to find out the all-time cutest kitten, basically because we think that would be boring. Perhaps I should look at adding an All-time Cutest Kittens list to the site. I'll stick it on the end of my to do list, watch this space...
posted by tomsk at 4:38 AM on August 16, 2010 [7 favorites]


Response by poster: Hi. Thanks, everybody. I don't know much about statistics, so I hope that the ones I marked make sense.

And thanks for the reasoning on the current model. It makes a lot of sense, even if it means that I have to go to page 2 for Joe Dirt. I may feel differently if Joe Dirt slips to page 3, though :)

Maybe one day, a "hall of fame" page with the Wilson/Bernoulli method mentioned above?
posted by Galaxor Nebulon at 7:34 AM on September 14, 2010


« Older Getting Things Filed   |   I Want to Donate Leftover Antibiotics Newer »
This thread is closed to new comments.