Are there statistical formulas that, unlike averages, account for skew?
April 1, 2013 10:31 PM Subscribe
I've been thinking about product ratings online. Product A and B both have an average rating of 4 stars. Product A is universally liked: every reviewer gave it 4 stars. Product B is the Twilight Series: lots of people love it (5 stars), but many 1 star reviews drags down the average to 4 stars.
Other than displaying the rating distribution (# of 1 through 5 star reviews), are there well-known formulas that would give Product A a higher rating?
I think what I'm asking about are weighted means, or some sort of formula that takes into account variance or skew.
But rather than re-invent the statistical wheel, I was hoping some of you may be able to point out well-known examples of good weighted formulas, or research related to this question.
Hope this is clear! Thank you!
Well, averages are more sensitive to outliers than other measures of central tendency. In a case like the one you describe, the median would give a more realistic view of the midpoint of the ratings, so in the case of Product A it would be 4, whereas for Product B the exact number would depend on exactly how many ratings were given overall, and at each level. The mode, on the other hand, would show the most frequent rating: again, for Product A it would be 4, but Product B's would depend on other factors.
I don't think there's any way to say up front, without knowing how many ratings or their general distribution, which measure would rate Product A higher. It would be nice if rating sites routinely gave more complete stats (at least mean, median, mode) for each product, but I suspect most people would find that more confusing than helpful. Could be nice as an opt-in feature or a separate "more info" screen, though.
An actual statistician could probably give you more sophisticated advice here. I use statistics, but not many of the kind that would help you here.
posted by Superplin at 10:40 PM on April 1, 2013
I don't think there's any way to say up front, without knowing how many ratings or their general distribution, which measure would rate Product A higher. It would be nice if rating sites routinely gave more complete stats (at least mean, median, mode) for each product, but I suspect most people would find that more confusing than helpful. Could be nice as an opt-in feature or a separate "more info" screen, though.
An actual statistician could probably give you more sophisticated advice here. I use statistics, but not many of the kind that would help you here.
posted by Superplin at 10:40 PM on April 1, 2013
Best answer: I think a geometric average rather than an arithmetic average would be helpful.
5, 5, 5, 1, 1, 1 are the votes.
The mean (arithmetic average) is 3.
The geometric average is 2.23.
("Geometric average": for n numbers, multiply them together, then take the nth root of the product.) The only thing you have to watch out for is that you don't have any zeros.
posted by Chocolate Pickle at 10:57 PM on April 1, 2013 [3 favorites]
5, 5, 5, 1, 1, 1 are the votes.
The mean (arithmetic average) is 3.
The geometric average is 2.23.
("Geometric average": for n numbers, multiply them together, then take the nth root of the product.) The only thing you have to watch out for is that you don't have any zeros.
posted by Chocolate Pickle at 10:57 PM on April 1, 2013 [3 favorites]
Best answer: What about checking higher moments, like the third standardized moment, aptly named skewness? Maybe more to the point of assessing ratings, especially considering that people's opinions often polarize and clearly separate between a group of 5s and 4s and a group of 1 stars, would be a statistic of bimodality, e.g. the bimodality coefficient which combines both skewness and kurtosis into a very simple formula.
posted by phphph at 11:01 PM on April 1, 2013 [4 favorites]
posted by phphph at 11:01 PM on April 1, 2013 [4 favorites]
A different example:
5 5 5 5 5 5 1 1
Mean is 4.
Geometric average is 3.34.
posted by Chocolate Pickle at 11:03 PM on April 1, 2013
5 5 5 5 5 5 1 1
Mean is 4.
Geometric average is 3.34.
posted by Chocolate Pickle at 11:03 PM on April 1, 2013
Computing the standard deviation should give you an idea of how controversial the rating is -- if it's all 5's and 1s, it's going to be much higher than a rating of all 4s and 3s.
posted by empath at 11:13 PM on April 1, 2013 [3 favorites]
posted by empath at 11:13 PM on April 1, 2013 [3 favorites]
Product A is universally liked: every reviewer gave it 4 stars. Product B is the Twilight Series: lots of people love it (5 stars), but many 1 star reviews drags down the average to 4 stars.
Also, something else to keep in mind -- you don't have any particular reason to privilege 5s over 1s. Neither one is an outlier, if the amount of 1s are enough to significantly drag down the ratings. This is why netflix spent so much money coming up with a recommendation engine that isn't based on averages, but rather on similarities between the raters and other criteria.
posted by empath at 11:15 PM on April 1, 2013 [2 favorites]
Also, something else to keep in mind -- you don't have any particular reason to privilege 5s over 1s. Neither one is an outlier, if the amount of 1s are enough to significantly drag down the ratings. This is why netflix spent so much money coming up with a recommendation engine that isn't based on averages, but rather on similarities between the raters and other criteria.
posted by empath at 11:15 PM on April 1, 2013 [2 favorites]
Best answer: Have a look at "How Not To Sort By Average Rating"
posted by richb at 12:58 AM on April 2, 2013 [11 favorites]
posted by richb at 12:58 AM on April 2, 2013 [11 favorites]
Best answer: You haven't clearly expressed what your goal is. Every statistic has particular properties that make it special, and suitable for a particular goal. It may be that what you want can't be expressed in a single number, but in order to understand whether that's the case we have to know what it is you want to know.
But to answer your question, it may help you to know that the mean is the special number that minimizes the *squared* distance between all the values and itself. You can imagine changing the relative weightings between the two numbers in several ways:
1. Change the power, so you're not talking about *squared* distance, but maybe absolute distance. The median, for instance, is the number that minimizes absolute distance. If you change to the median, that product with the 1 rating will get rated higher. If you increase the power to cubing, instead of squaring, the opposite will happen. The 1 rating will drag the "average" rating down more.
2. You can transform before you take the mean. The gemoetric mean (mentioned above) is equivalent to taking the logarithm before transform. But you're not limited to the logarithm: imagine the transformation f(x) where all values that are 1 will get changed to -100, and all other values remain the same. Then you take the mean. This is a perfectly valid transformation, and does exactly what you want: it penalizes the 1 rating. You can imagine penalizing all the ratings less than five in the same way, but each subsequently less: for instance, 1 -> -100, 2 -> -50, 3-> -20, 4 -> 0 . That's extreme, but it demonstrates the point. What you're applying here is an extreme penalty for low scores. You can make up whatever penalty you like, and apply it.
posted by Philosopher Dirtbike at 2:03 AM on April 2, 2013 [4 favorites]
But to answer your question, it may help you to know that the mean is the special number that minimizes the *squared* distance between all the values and itself. You can imagine changing the relative weightings between the two numbers in several ways:
1. Change the power, so you're not talking about *squared* distance, but maybe absolute distance. The median, for instance, is the number that minimizes absolute distance. If you change to the median, that product with the 1 rating will get rated higher. If you increase the power to cubing, instead of squaring, the opposite will happen. The 1 rating will drag the "average" rating down more.
2. You can transform before you take the mean. The gemoetric mean (mentioned above) is equivalent to taking the logarithm before transform. But you're not limited to the logarithm: imagine the transformation f(x) where all values that are 1 will get changed to -100, and all other values remain the same. Then you take the mean. This is a perfectly valid transformation, and does exactly what you want: it penalizes the 1 rating. You can imagine penalizing all the ratings less than five in the same way, but each subsequently less: for instance, 1 -> -100, 2 -> -50, 3-> -20, 4 -> 0 . That's extreme, but it demonstrates the point. What you're applying here is an extreme penalty for low scores. You can make up whatever penalty you like, and apply it.
posted by Philosopher Dirtbike at 2:03 AM on April 2, 2013 [4 favorites]
Best answer: I'm not sure that Product A should have a higher rating than Product B in all cases. If someone sells a film, which people either love or hate (ie 1 or 5 stars) I'd be more interested in watching it than a film which everyone thinks is ok (all 3 stars).
One way of displaying the difference in spread of distributions is standard deviation. Maybe this could be displayed alongside the mean, or combined with it in a slightly hacky way if you wanted to be able to search? I'd love it if you could search IMDB ratings by standard deviation alone.
Even this though, doesn't give you all of the information about the distribution, one website that does, is Amazon. If you give the customer one of those little graphs of stars, then they're free to make their own decision about what kind of distribution they want.
posted by Ned G at 2:17 AM on April 2, 2013
One way of displaying the difference in spread of distributions is standard deviation. Maybe this could be displayed alongside the mean, or combined with it in a slightly hacky way if you wanted to be able to search? I'd love it if you could search IMDB ratings by standard deviation alone.
Even this though, doesn't give you all of the information about the distribution, one website that does, is Amazon. If you give the customer one of those little graphs of stars, then they're free to make their own decision about what kind of distribution they want.
posted by Ned G at 2:17 AM on April 2, 2013
A violin or box plot will show you at a glance how your data are distributed. Skewness will show up as asymmetry in the distribution, relative to the median.
posted by Blazecock Pileon at 2:19 AM on April 2, 2013 [1 favorite]
posted by Blazecock Pileon at 2:19 AM on April 2, 2013 [1 favorite]
Best answer: Wilson score confidence?
posted by A Terrible Llama at 2:45 AM on April 2, 2013 [1 favorite]
posted by A Terrible Llama at 2:45 AM on April 2, 2013 [1 favorite]
This is why we look at all of the basic properties of any set of numbers: mean, median, and mode.
Mean is just the average, and as you've found, that has its drawbacks.
Median is the middle number in the set. This is helpful in revealing skewed data. Consider the sets {1, 10, 10} and {6, 7, 8}. The mean of both is 7, but the medians are different: 10 and 7. This gets even more dramatic in things like income distributions. Take newly-graduated attorneys, for example. Mean starting salary? Even today, it's something like $80-90k. Median starting salary? Closer to $60k. This suggests that there are a few really big numbers in the set, but a lot of much smaller ones.
Mode is also significant. It's the number that appears most frequently in the set. It's a bit less useful than median, because it only tells you about one number, but in combination with the others it can give you a pretty fair sense of what's going on in your data set, particularly in large ones.
Of course, all of these are basically useless when it comes to analyzing something like user ratings. The data in that sent doesn't tell you anything at all, except perhaps whether more fans or haters have discovered a particular product and bothered to review it. This is especially true when n is small, but even when it's not, user ratings are not, in general, a good way of obtaining useful information about the world, as they're based largely on self-selection and frequently motivated by either the need to air a grudge or (in Amazon's case) self-interested self-promotion. So you see lots of fours and fives, lots of ones, and almost no twos or threes. This is not what one would expect out of a meaningful rating metric. It also winds up rating instant coffee, comic books, guitars, and movies on the same metric, i.e., completely and totally disparate things.
posted by valkyryn at 3:47 AM on April 2, 2013
Mean is just the average, and as you've found, that has its drawbacks.
Median is the middle number in the set. This is helpful in revealing skewed data. Consider the sets {1, 10, 10} and {6, 7, 8}. The mean of both is 7, but the medians are different: 10 and 7. This gets even more dramatic in things like income distributions. Take newly-graduated attorneys, for example. Mean starting salary? Even today, it's something like $80-90k. Median starting salary? Closer to $60k. This suggests that there are a few really big numbers in the set, but a lot of much smaller ones.
Mode is also significant. It's the number that appears most frequently in the set. It's a bit less useful than median, because it only tells you about one number, but in combination with the others it can give you a pretty fair sense of what's going on in your data set, particularly in large ones.
Of course, all of these are basically useless when it comes to analyzing something like user ratings. The data in that sent doesn't tell you anything at all, except perhaps whether more fans or haters have discovered a particular product and bothered to review it. This is especially true when n is small, but even when it's not, user ratings are not, in general, a good way of obtaining useful information about the world, as they're based largely on self-selection and frequently motivated by either the need to air a grudge or (in Amazon's case) self-interested self-promotion. So you see lots of fours and fives, lots of ones, and almost no twos or threes. This is not what one would expect out of a meaningful rating metric. It also winds up rating instant coffee, comic books, guitars, and movies on the same metric, i.e., completely and totally disparate things.
posted by valkyryn at 3:47 AM on April 2, 2013
I agree with empath, it seems like you are looking for the standard deviation.
Consider two sets of numbers:
1 1 5 5 5 5
3 3 4 4 4 4
Both have an average of 3.6. But the standard deviation of the first set is 4.6, and the standard deviation of the second set is 1.16. High standard deviation means the results are more chaotic, low means they are tightly grouped.
What this tells you, in the case of product reviews (assuming the ratings are more or less honest and randomly distributed), is that in the first case, you can't be sure whether you will like or hate the product. In the second case, you can be more sure that you'll find the product to be around a 3 or 4.
If it's opinion based, then it's telling you that opinions vary widely and you can't draw much information from the average. Or they don't, and you can.
But yes, the point about ratings being highly subjective and self-selected is quite valid. You have to have good data that asks the question you want answered and isn't poisoned by noise.
posted by gjc at 6:28 AM on April 2, 2013
Consider two sets of numbers:
1 1 5 5 5 5
3 3 4 4 4 4
Both have an average of 3.6. But the standard deviation of the first set is 4.6, and the standard deviation of the second set is 1.16. High standard deviation means the results are more chaotic, low means they are tightly grouped.
What this tells you, in the case of product reviews (assuming the ratings are more or less honest and randomly distributed), is that in the first case, you can't be sure whether you will like or hate the product. In the second case, you can be more sure that you'll find the product to be around a 3 or 4.
If it's opinion based, then it's telling you that opinions vary widely and you can't draw much information from the average. Or they don't, and you can.
But yes, the point about ratings being highly subjective and self-selected is quite valid. You have to have good data that asks the question you want answered and isn't poisoned by noise.
posted by gjc at 6:28 AM on April 2, 2013
Something to think about regarding geometric mean: it's not invariant to a linear transformation of the ratings.
E.g., if the ratings are 5, 5, 5, 1, 1, 1 the geometric mean is 2.23.
Now imagine that the site decides, for some bizarre reason, that it's going to have users rate products on a scale of 101-105, and the old ratings are converted to the new scale.
Ratings: 105, 105, 105, 101, 101, 101. Geometric mean: 102.98 (very close to arithmetic mean of 103).
Not saying you can't use the geometric mean, but I'd be aware of its limitations.
posted by DevilsAdvocate at 7:03 AM on April 2, 2013
E.g., if the ratings are 5, 5, 5, 1, 1, 1 the geometric mean is 2.23.
Now imagine that the site decides, for some bizarre reason, that it's going to have users rate products on a scale of 101-105, and the old ratings are converted to the new scale.
Ratings: 105, 105, 105, 101, 101, 101. Geometric mean: 102.98 (very close to arithmetic mean of 103).
Not saying you can't use the geometric mean, but I'd be aware of its limitations.
posted by DevilsAdvocate at 7:03 AM on April 2, 2013
Skewness (third moment) has already been mentioned, you should also look into kurtosis (fourth moment) and the method of moments in general.
Mean = 1st moment
Variance = 2nd moment
Skewness = 3rd moment
Kurtosis = 4th moment
posted by 445supermag at 7:33 AM on April 2, 2013
Mean = 1st moment
Variance = 2nd moment
Skewness = 3rd moment
Kurtosis = 4th moment
posted by 445supermag at 7:33 AM on April 2, 2013
For data sets like this, rather than going with a straight median, it can be useful to look at the interpolated median.
posted by solotoro at 8:18 AM on April 2, 2013
posted by solotoro at 8:18 AM on April 2, 2013
Came in here to say "Lower bound of Wilson score confidence interval for a Bernoulli parameter" but richb and A Terrible Llama beat me to the exact link I had in mind. I think I just saw it on Hacker News some time last week?
posted by RedOrGreen at 11:48 AM on April 2, 2013
posted by RedOrGreen at 11:48 AM on April 2, 2013
« Older Tips for making a winter shelter for an expectant... | "Early April Fool's! Your Pap smear was wrong!" Newer »
This thread is closed to new comments.
posted by animalrainbow at 10:40 PM on April 1, 2013 [3 favorites]