Statistics about differently-lengthed linear rankings
May 11, 2021 8:31 AM   Subscribe

Suppose I have a list of people's personal rankings of fruits, from best to worst. People skip the fruit they are unfamiliar with, so different rankings may be different lengths. What interesting data can I generate from my list?

Some example rankings, from four different people with different opinions from one another:
  • Banana > Cherry (read: this person thinks bananas are better than cherries, but this person included only these two fruit in their ranking, no others)
  • Grapefruit > Mango > Cherry (read: grapefruits are better than mangoes, and both are better than cherries)
  • Apple > Cherry > Banana > Grapefruit > Mango
  • Pitaya > Breadfruit
People are not allowed to write in their own fruit: I have full access to the set of all possible fruit they are allowed to include in their rankings. There is no limit on how many different rankings may exist (the four above are just an example), though at any time I know how many rankings there are. A given fruit may appear no more than once in a ranking: "apple > banana > apple" is invalid.

Bonus question: do I stop being able to generate any data if people are allowed to rank two or more fruit equally, e.g. "apple > [mango = guava] > banana" or even "[cherry = breadfruit]"?
posted by one for the books to Grab Bag (7 answers total) 3 users marked this as a favorite
Well, you could pick the Best Fruit in a number of ways, but they are all flawed in some manner: (skip the first paragraph unless you like math jargon, the second bit is much better)
Even if the Best Fruit is not the stat you are interested in, this is a good start at understanding the types of contradictions you might run into.

I think there are lots of interesting stats you might potentially have, but are not guaranteed to have for any dataset. Eg: nobody likes cherries best, there seems to be a divide in types of fruit that people are familiar with (3 US grocery-store people, 1 "exotics" person), nobody likes mangoes better than grapefruit, cherries are the most-familiar fruit. You can come up with datasets where these cannot be stated for any one fruit but you can usually find at least a few on these recipes.
posted by february at 8:56 AM on May 11, 2021

Best answer: A two-axis grid of pair-wise ratios for all possible pairs with a color scale for the ratio is an easy visual representation. A third dimension, perhaps represented as a second 2D plot, showing the significance of each result would be nice. (You'll need to make some assumptions about the distribution - if some more detailed info on how one might go about doing that that is useful, do ask.)

Looking for correlations among larger groups could be fun. e.g., do people who rank apples above cherries also rank cherries above bananas in a statistically significant way? One could automate that test without too much work - take every set of three and find all the responses that have all three, then count how many rank them in all six possible combinations. You'll need to make some assumptions about the distribution to get a confidence value, then look for the ordered sets that exceed some criteria. This is a pain in the neck to do by hand, but not that hard to automate with some simple programming.

(Don't forget to take into account the *number* of tests you're trying when calculating significance. One test with a 95% confidence is very different from twenty tests in which one yields a single-test 95% confidence. If you're not going to repeat the experiment, it's important to take into account how many tests you're conducting. That is, assuming quantitatively accurate results are important. If it's just for fun and nobody cares about the actual result, it may be less important.)

Regarding the bonus question, I'd argue that you can generate three sets of useful data from that response: apple>mango>banana, apple>guava>banana, mango = guava. And you can use them all as data. (Don't double-count the apple>banana result, though! It's only one data point.)
posted by eotvos at 9:04 AM on May 11, 2021

I've always been a fan of looking at the opposite of your data.

Looking at the absences in the data might be illustrative, or at least identifying what's *not* a favorite. What fruits are folks most unfamiliar with? What is regularly not being chosen as favorite but is still present in folks' lists?

In certain situations the best fit may be the item that shows up the most but is not anyone's favorite, or that including fruits that everyone is unfamiliar with may bring fewer preconceived notions.
posted by justnathan at 9:57 AM on May 11, 2021 [2 favorites]

The first step I would suggest is to denormalise the data and populate a database, so a single line like
Apple > Cherry > Banana > Grapefruit > Mango
Will become:
Apple > Cherry
Cherry > Banana
Banana> Grapefruit
Grapefruit > Mango
Apple > Banana
Apple > Grapefruit
Apple > Mango
Cherry > Grapefruit
Cherry > Mango
Banana > Mango
That may seem like a lot of work, but stored in a database along with the Persons name/id and you can start to perform all kinds of delicious queries like, what proportion of people thought Banana > Cherry vs what proportion thought Cherry > Banana, including people who stated that indirectly as e.g. Cherry > Apple > Banana.
Other queries to consider, what are the largest number of fruits in anyones list?
For people with the longest list (so the greatest fruit eating experience) what is their favourite?
How does that compare with people who have the shortest list?
posted by Lanark at 9:57 AM on May 11, 2021 [3 favorites]

I'd focus on the people making the choices a) are the data reproducible [ask punters to rank the fruit with a week between] b) People's reaction to things depends on whether they are before or after lunch [when you are up for the parole board in a classic study although there has been push-back on that finding]. Checking this out for food prefs would be deliciously self-referential.
posted by BobTheScientist at 10:07 AM on May 11, 2021

I would start by getting the big picture of the data. How many fruits are mentioned? How many times does each fruit appear in a compare? How many times is it preferred, and how many times is it not preferred? Go on to how many times is it #1, #2 or #3 in a sequence?
posted by SemiSalt at 12:46 PM on May 11, 2021

I would throw it in a graph (the network sort) sort of program and let it lay things out. Like Graphviz. Then assuming you have a fixed number of items total, I'd weigh each one out to cover the full range. Like if there are 5 item and one response is X>Y that's X(0),Y(10). A>B>C is A(0),B(5),C(10). Run stats on those just assuming that the not ordering every possible element is understood as unimportant to the raw A>B>C. Count the chains of orderings by length, do the A>B, A>B>C, A>B>C>D on their own.

Run a simulation, generate every N-order ranking possible and test them for compliance with the provided actual input. Orderings should work out like this, how many real answers agree with this full ordering just following the match A>X>Y fit of the sample could match the generated ordering.
posted by zengargoyle at 7:38 PM on May 11, 2021

« Older A thank you for setting my mind at ease   |   What's the perfect Alaska bag? Newer »
This thread is closed to new comments.