Ask a Stats Nerd
December 15, 2009 8:18 AM   Subscribe

I need to ruin everyone's fun by adding a rigorous mathematical scoring system to the company bake-off. Li'l help?

We just had an employee bake-off and while the outcome was satisfactory to all, I think the scoring methodology (average of ratings on a 1-10 scale) was too arbitrary to be statistically meaningful.

How would you build an objective and mathematically sound ranking system based on the following criteria?

* Entries will be judged by all participating employees in each of three categories: taste, presentation, and creativity
* There will be a winner within each category, as well as an overall winner
* Not everyone has to vote on every entry

We make software, so yes, the methodology by which we judge pie is quite critical.
posted by sonofslim to Grab Bag (17 answers total) 1 user marked this as a favorite
It'd be nice to hear more about what you believe to be the defects with the current system.

But off the top of my head-- could you give every judge a certain total number of points to allot in each category (say, 100 taste points, 100 presentation, 100 creativity) and let them divvy the points up among the entries as they choose (up to 10pts per entry per category)? Then first place goes to the entry with the most points overall, and the category winners would be the entries with the most points in any given category.

Not sure how the logistics would work (shoeboxes for each entry, and monopoly money to represent points?), but such a system would provide a rough estimate of the overall popularity of each dish, as well as removing the outcome where one entry with a single vote of 10 wins over the entry with 10x 10-votes and one vote of 9.
posted by Bardolph at 8:38 AM on December 15, 2009

Bean Plate Alert

-overall taste experience
-usage of complimentary tastes
-moisture (too wet, too dry, etc.)
-quality of ingredients

-color (even? burned? undercooked?)
-palate-ability (idk how to spell that word. Does it look delicious?)
-embellishments (too much? just right?)

-taking something familiar and owning it (apple pie, or Suzie M. Flagbender's Apple Pie?)
-name (seriously, nobody ever gets bonus points for clever names.)

And I'm done. Too much thinking about pie. Time to go eat lunch.
posted by TomMelee at 8:38 AM on December 15, 2009

I would start by decided which category is the most important. I personally think taste is more important than presentation, when it comes to baked goods.

And what is "creativity" exactly? Is it unusual ingredients? My old fuddy-duddy office would treat this as the least important criteria, but again this may differ for you.

In addition to coming up with the formulae, you could also gather tons of data on each submission and use that to present interesting metrics. For example, if you subcategorize the goods into cookies, cakes, and pies, you could then present metrics on average scores of cookies v. pies and stuff like that. Perhaps even put demographic information together with it ("men provided 90% of cookies, but only 10% of pies") Maybe get each submitter to fill out a short questionnaire on their item.

yes I work with data, why do you ask?
posted by cabingirl at 8:38 AM on December 15, 2009

Oops, I got so excited by the data I left out my first point, which was to give different weights to the three categories when determining an overall winner. So 40% taste, 30% presentation, 30% creativity. Or whatever.
posted by cabingirl at 8:45 AM on December 15, 2009

Response by poster: Clarification: I'm looking for a formula to normalize subjective scales. What we did was rate each entry on a scale of 1-10 and take the average, which is flawed for several reasons. Least of which, it's possible for an entry to receive exactly one vote; if that vote is a 10 for taste it ensures a bogus victory at the category level.

So I'm looking for the methodology and math that will control for a) subjective scales, and b) uneven numbers of votes for each entry. I was thinking of asking people to order the entries they tried by preference, but I'm not sure how to extract a score from that info nor am I convinced it's any less flawed than the sum-of-averages approach.
posted by sonofslim at 8:52 AM on December 15, 2009

Is the problem that people are more or less generous with their numbers? Like, one person gives out all 8s and 9s, while some cranky guy doesn't give anything higher than a 5? If everybody rated everything, it would all average out and this wouldn't be a problem. But since not everyone votes on every entry, you could try to normalize votes to account for this. Take each person and scale their votes to span the entire 1-10 range.

Sometimes I've wondered about constructing a voting system based on binary comparisons. As a judge, it's hard to rate things reliably on a 1-10 scale if you haven't tasted the full range of entries yet, but it's pretty easy to just look at two things and say that one is better than the other. So instead of collecting numeric ratings for items, you could collect comparisons: pie A is more creative than cake B, cupcake C is tastier than pie A, etc. Then you feed those into some ridiculous linear program or something and compute the set of ratings for the items that violates the fewest comparisons.
posted by equalpants at 8:54 AM on December 15, 2009

If you get a high enough number of judges, your average rating system could become statistically meaningful. You don't need like 10^9 or something, just 20 or so would work.

maybe give everyone a vote on every entry to remedy the low-n statistics.

If you cant get n to be 20 or 30, you may want to look up how certain scientific groups perform statistics on low-n (like n=9 or so) experiments. biology? neuroscience? assign a p-value to every judgment?

or maybe you could do this...

Treat every vote like a Poisson distribution about some mean value (5/10=0.5, say. i think the average of all votes would be a good value for this). Treat every vote as a random event with expectation value of 0.5 (or whatever you decide to use as your mean value. but remember to normalize, so, a vote of 8 would be 0.8). Then, take the probability that your vote is not a mean (P(no-mean) = 1 - Sum{value of the Poisson distribution of each judgement, centered around your mean value}). The cake with the lowest value of P(no-mean) is the most significantly different from the mean, and therefore is either a clear winner or a clear loser. Check the average value to find out which is which.

This works better if you tell your judges to use all 10 points of their 10-point scale.
posted by chicago2penn at 8:59 AM on December 15, 2009

Maybe it's because I'm an elections nerd, but this is a great use of ranked choice voting. So instead of trying to find a way in which you rate to some objective scale, everything is ranked relative to the others.
posted by advicepig at 9:02 AM on December 15, 2009 [2 favorites]

Whoops, should've previewed. I think rescaling each person's votes so that they cover the full range should work well enough. That is, if P's votes go from 2-8, he has a range of 6, so multiply by 9/6 = 1.5 to expand up to the full range, then subtract 2 to recenter.

Some people are going to have mostly high votes and just a couple low ones, etc. But you don't want to correct too much for that, since they may well have tasted only the best items. So I'd probably just forget about that particular quirk and stick to the rescaling.

I don't think you can do anything about uneven numbers of votes, though. If not enough people sample an item, there just plain isn't enough data to effectively compare it to the others.
posted by equalpants at 9:03 AM on December 15, 2009

A group of friends and I have an annual pie competition. We have tried out many different methods but eventually fell back to weighted categories. Each category has a weight out of 100 and each participant assigns a score out of that weight. In the end, everyone will have assigned a score out of 100 for each pie. As you can imagine, arguing over the system is half the fun of the competition in first place. So for example,

Crust (20):
Presentation on Plate (15)
Presentation in pie dish (15)
Filling taste: (30)
Seasonality/Originality (20).
TOTAL Possible points: 100

You can use TomMelee's list to pick from for your list. Then what you do with the scores is important. We just assign a rank from each participant to each pie. Then use a Olympic Gold (3), Silver(2), Bronze(1) system to assign the points that count to each pie. The advantage of this over just summing the total of each person's score for each pie is that some people give all pies marks in the 90s while others have wide variances.

The pie that wins the most medal points wins. If you don't want to do the medal round, just add up the ranks of each pie from each participant and the lowest score wins. This is the golf version. However it allows a middling pie with all 3rd place finishes to beat pies that come in 8th on a couple score cards while first in many.

If you really want to make everyone's life difficult. Give the person a choice of their top 5 criteria from a super list and then allow them to submit their personal weightings a day before the contest when they haven't seen the pies. Then print out their personal score card.

If you want some academic foundation to base your pie judging on, search for Multi-Criteria Decision Making (MCDM). While it is a wide field, its most basic form is basically weighted categories as described above.

A completely insane method you could employ is pair-wise comparisons. Look it up! A participant chooses one baked good over another in a "pair-wise comparison" and then moves to a new comparison. Eventually a winner emerges. It is useful because people can't keep many criteria and comparisons in their mind at once but they can say which of two items they prefer. The issue with this is that it is pretty insane to organize for a large group. However, survey companies have software for this so it may fall into something you like. If you don't have too many items to compare, it is good because you get to eat loads. There is some theories about how to combine pair wise comparisons from many people so that not everyone has to do every single pairwise comparison. For some reason I feel this has parallels to a bubble sort which may please your software-centric audience.

Warning: Only about 1/3 people find the haggling over the method interesting. The 2/3 people will think you are an idiot and just want to eat pie and don't care about who wins. So the more insane the method, the stronger your leadership needs to be. Otherwise, the whole system falls apart with people ignoring it.
posted by FastGorilla at 9:08 AM on December 15, 2009

Response by poster: 2/3 people will think you are an idiot and just want to eat pie

Oh, my coworkers have already established this. But did Galileo remain silent when they told him to shut up and pass the pie? These are some great suggestions everyone, thanks!
posted by sonofslim at 9:58 AM on December 15, 2009

My roommate did something like this for multiple-judge scoring of a Rock Band 2 competition on style. The third post in this forum thread has a decent method of normalizing scores from multiple judges with subjective criteria. The basic concept is that, for each judge, you calculate the mean and the standard deviation of all the scores given (which excel does easily). Then for a given score, you calculate ([individual score]-[judge's mean])/[judge's standard deviation]. See the Wikipedia entry on Standard Score for more on the formula.

At that point you will have normalized votes from each judge such that the mean of the judge's votes is 0 and the standard deviation of the judge's votes is 1. Those votes can then be meaningfully compared with each other across judges for a given baked good. This method should be resilient even with judges scoring different numbers of goods, although a judge who only scores one or two will still probably have some disproportionate effect.

It's been awhile since I've had any statistics, so I welcome corrections if I did something wrong here.
posted by Partial Law at 10:08 AM on December 15, 2009

Metacomment: there is, provably, no such thing as an objective ranking system for matters of taste like this. Arrow. Likewise, there is no ranking system such that a sneaky so-and-so couldn't lie on their ballot to try to make someone win or lose. Gibbard-Satterthwaite.

Real comment: These all seem needlessly complicated to me. I would just do a set of Borda counts.

Each voter gets 4 "ballots": taste, presentation, creativity, overall.

Each voter ranks the top 5 pies (or 10, or whatever an appropriate number would be given the size of your contest).

On each ballot, the top-ranked gets five (or N) points, the second ranked gets 4, etc. Pies that are not ranked on the ballot get no points from that voter. The winner in each category is just the pie with the most points. Break ties with coin flips.

If you want, ditch the overall category and the overall winner is the pie with the most points overall.
posted by ROU_Xenophobe at 10:13 AM on December 15, 2009

Strongly seconding rank choice (or instant runoff) voting!

Have each voter list their favorites for each category, in the order of preference. They can list as few or as many they like as long as each item is a one they could support for victory. The results are then compared and the winner is the entrant that received the greatest number of top spots. The overall winner can be determined either by having a separate voting category for "overall excellence", or by combining the top spots in each of the three individual categories.

Instant runoff voting is less vulnerable to political maneuvering and election tactics than many other methods since you are not lessening your influence by casting a top spot vote for an unlikely winner. Normally, if you know that the Garlic and Blue Cheese Muffin has only the slimmest chance of victory against Blueberry Pancakes or Apple Pie, you'd be inclined to give your vote to a popular favorite you find at least somewhat acceptable. Otherwise your vote is lost, or in the case of Borda count, has less weight behind it. In instant runoff voting, each spot on the list matters, and if so happens that the Muffin doesn't get enough top spots, your other choices still have full effect in determining the victor.
posted by Orchestra at 11:21 AM on December 15, 2009

Pooh-pooh to those suggesting IRV or ranked voting or such.

Range voting, which is what the asker already uses, has been shown to be superior to all other major voting methods. It makes sense; the more information people are allowed to provide about their preferences, the better an intelligent voting system can provide a true winner.

Splitting into subcategories is disadvantageous in that people may have different weights they would assign to those categories, but you could add a "personal tilt" category to accomodate this to a degree.

If you wanted to apply some intelligence to the voting system, you could use the mathematical formula given in the article "How Not To Sort By Average Rating". What it's basically doing is taking the number of votes into account; a score of 9 with 1 vote is not necessarily better than a score of 8.8 with 10 votes. This won't matter much if all items receive the same number of votes or if all items receive a large number of votes, but it seems like that won't be the case for you.
posted by Earl the Polliwog at 3:07 PM on December 15, 2009

If I understand correctly, range voting wouldn't prevent election tactics at all too well. If the winner is simply the entrant with the highest average score, wouldn't I be inclined to give a full hundred points to my favorite, the Garlic and Blue Cheese Muffin, and a big fat zero to everyone else? I would have to assume that everyone else was pushing their candidates with equal ruthlessness, since if that was the case, being fair and balanced would work strongly against my interests.

As we all know, people are conniving, soulless monsters when it comes to bake-offs.
posted by Orchestra at 12:14 PM on December 17, 2009

The range voting page goes into that - it's all the cases marked "strategic". In the worst case, range voting degrades into approval voting, which is still better than IRV and the others under strategic voting. In fact, range voting with strategic voters is better than IRV with honest voters.
posted by Earl the Polliwog at 5:38 PM on December 17, 2009

« Older Can I get a non-kosher Edible Arrangement to...   |   Where can I go to become a certified nursing... Newer »
This thread is closed to new comments.