Excel Formula to Weight Ratings By Number of Ratings?
July 5, 2016 9:14 AM   Subscribe

How do I write an Excel formula to weight a rating by the number of ratings, so that a 4 with tens of thousands of ratings weighs more than a 5 with three ratings?

I have a list of books with rating and number of ratings. For example, a line of data might be, in columns A, B and C, respectively: "Firestarter, 4.8, 42573" or "Funny Anemones, 5.0, 3".

I'd like to sort these by rating, but sort them in a way where a higher number of ratings makes a rating more valuable. For example, a five-star rating from two ratings would mean a lot less than a four-star rating from a thousand ratings.

The closest answer I could find via Google was this, but the math in that answer is above my understanding, so that I don't know how to translate it into a formula.
posted by WCityMike to Science & Nature (13 answers total) 6 users marked this as a favorite
Why not calculate a confidence interval around the mean? Then you could sort on the lower bound of the confidence interval (you could also sort on the upper bound, but since you want ot weight down fewer ratings, the lower bound is the most relevant).

The confidence interval would take into account both the number and variability of ratings. So the fewer ratings and the more variable the ratings, the larger the confidence interval. Two things rated 4 but one with few very variable ratings and one with many consistent ratings could have confidence intervals of say 1.3 to 6.7 (yeah, they're symmetric, so it could go above 5, but you don't care about the upper bound, anyway) and 3.8 to 4.2.

To calculate a 95% confidence interval, calculate the standard deviation of each measure (excel does this), divide by the square root of the number of ratings, which gives you the standard error. Multiply the standard error by 1.96 and then add (for the upper bound) and subtract (for the lower bound) that number from the mean. 1.96 is for a 95% confidence interval. You can use other numbers for other levels of confidence.

Confidence intervals.
posted by If only I had a penguin... at 9:24 AM on July 5, 2016 [4 favorites]

It sounds like you are looking for the weighted average of your ratings.

Assuming your columns "Funny Anemones, 5.0, 3" are columns A, B and C,

for cell D1, enter "=B1*C1".

Now, copy and paste that formula through the entire column of D.

Finally, assuming you have 342 entries, then for one cell (say E1) enter "=sum(D1:D342)/sum(c1:c342)". The answer will be your weighted average.
posted by Dashy at 9:32 AM on July 5, 2016 [1 favorite]

For reference, the reason you're finding it hard to locate a formula is that there is no "right" way to do this: you have to decide how much each factor contributes to the overall score.

For example, should a ★★★ with 10000 ratings beat a ★★★★★ with one? Etc, etc.

I think the confidence intervals formula above is nice.
posted by katrielalex at 9:36 AM on July 5, 2016 [4 favorites]

I'm not a whiz at this, but I think that formula from your link should look like this in Excel:

That's assuming you've got your rows/columns like listed above:
A, B, C,
Firestarter, 4.8, 42573,

Running those two data points you list above gives the following with the last column being the weighted rating:
Firestarter, 4.8, 42573, 4.9
Funny Anemones, 5, 3, 2.5
posted by gregr at 9:37 AM on July 5, 2016

One easy and standard way to do this is to pretend every movie has some fixed number of average reviews in addition to the ones you know about.

Say you add 10 reviews with a rating of 3 to every movie. Then Firestarter, instead of getting (4.8 * 42573) / 42573 = 4.8, would get (4.8 * 42573 + 3 * 10) / (42573 + 10) = 4.7996. Meanwhile, Funny Anemones, instead of getting (5.0 * 3) / 3 = 5.00, would get (5.0 * 3 + 3.0 * 10) / (3 + 10) = 3.46.

The more "fake reviews" you add, the more real reviews you will need to drag it away from the mediocre rating that the fake reviews are weighting you down with.
posted by dfan at 9:53 AM on July 5, 2016 [5 favorites]

Best answer: Here's a great article about this problem: How Not To Sort By Average Rating.
posted by miyabo at 9:56 AM on July 5, 2016 [6 favorites]

The article miyabo links to is good, but note that it only applies to Bernoulli processes (where there are only two outcomes, in this case positive and negative reviews). I don't know if there's a generalization to the multinomial case (where ratings can take on 5 different values, say).
posted by dfan at 10:52 AM on July 5, 2016 [2 favorites]

The article miyabo links suggests calculating a confidence interval and using the lower bound. The generalization is what I posted above.

A weighted average is wrong because the weighted average gets you ONE NUMBER, not one number per item. So rather than knowing what the weighted-rating for each book on Amazon was, you would know that the weighted average of all book ratings on Amazon is 3.4 (or whatever). It wouldn't tell you anything about ratings for individual books.
posted by If only I had a penguin... at 11:07 AM on July 5, 2016 [1 favorite]

I just realized that the confidence interval approach wouldn't work with only one rating. You can't calculate a proper standard error. I would substitute in the max standard error for anything with one rating.
posted by If only I had a penguin... at 11:20 AM on July 5, 2016

Though I would agree that confidence intervals would be a good approach if the required data were available, they don't seem possible with the level of data that the OP has: a book's mean rating, and the number of individual ratings that contribute to the book's mean rating.

Without the individual rating-level data (e.g., 42573 rows of data for "Firestarter", 3 rows of data for "Funny Anenomes", etc.), or a summary measure of the variability of those individual ratings (standard deviation), I do not see how confidence intervals could be calculated.
posted by mean square error at 11:49 AM on July 5, 2016 [3 favorites]

One easy and standard way to do this is to pretend every movie has some fixed number of average reviews in addition to the ones you know about.

This is known as a Bayesian average. I believe standard procedure is to give the "fake reviews" the average value of all reviews across all items. But as to how many "fake reviews" to include, I'm not sure there's a universally accepted answer to that. This page suggests the number should be the average number of reviews of all items, but I feel that may be too high.

As for the Excel formula: let's say you have names in A1-A100, average ratings in B1-B100, and number of ratings in C1-C100. Put the value for your fake rating in F1 — if the average of all ratings, that's =SUMPRODUCT(B1:B100,C1:C100)/SUM(C1:C100) — and put the number of fake ratings in F2 — if the average number of ratings, that's =AVERAGE(C1:C100), or just spitball what seems like a good value.

Now your Bayesian average for the first row is =(B1*C1+F$1*F$2)/(C1+F$2). Enter that in D1, then copy and paste into D2-D200.
posted by DevilsAdvocate at 11:50 AM on July 5, 2016 [4 favorites]

Best answer: Came here to suggest the same link as miyabo. ( I might have submitted that 2016 update with the Excel formula :)
posted by Spumante at 12:47 PM on July 5, 2016

Response by poster: Thank you, everyone. Revisiting this to let you know what I did, for any future Googlers.

I re-visited the Amazon page of each book on my reading list (112 of 'em!) and copied the breakdown of ratings from 1-5 (how many were 5, how many were 4, etc.). Then, I applied the math here to each rating to tally them into positive and negative values, and then used Spumante's Excel variant. (The construction of it like [@[Up Votes]] threw me ... )

So, placing it in descending order with number of 5 ratings in F2, these were my formulas:

Positive (K2):

Negative (L2):

Rating (M2):
=IFERROR(((K2+1.9208)/(K2+L2)-1.96*SQRT((K2* L2)/(K2+L2)+0.9604)/(K2+L2))/(1+3.8416/(K2+L2)),0)

Thank you!
posted by WCityMike at 8:45 PM on September 21, 2016

« Older What sort of snake is this?   |   How best to sell a boat? Newer »
This thread is closed to new comments.