Is it a rabbit or is it a duck?
March 26, 2019 8:51 AM   Subscribe

Please help us interpret what happened in this poll: two groups, one with three people (set 1), one with one person (set 2). Each person was shown the same picture and had to decide if what they were seeing was a duck or a rabbit. Exercise was repeated with 100 duck/rabbit pictures. We ended up with four answers for each picture, but now we are struggling to figure out what the answers tell us.

Things that seem meaningful:

1. number of times people in set 1 agreed vs disagreed with each other
2. number of times each individual in set 1 agreed/ disagreed with the solo individual from set 2
3. number of times the mean of set 1 agreed/ disagreed with individual from set 2 (so, for example, if you get 2 ducks and 1 rabbit in set 1 and individual from set 2 says ‘duck’, that is an ‘agree’, but if individual from set 2 says ‘rabbit’, that is ‘disagree’ between set 1 and set 2)

Now we are trying to figure out how to best express what happened in this poll via percentages as well as words (note: FWIW, we are not the same as the 4 individuals who answered the duck/ rabbit question).

One of us insists that overall agreement between the two sets should be calculated by taking the number of times all three people in set 1 agreed with each other and see how often they also agree with set 2 – this would be our total agreement percentage. So, for example, if set 1 reached agreement 15 times out of a hundred, and in 12 of those cases there is also agreement with the set 2 individual, then the way to make the final calculation is by figuring out what percentage of the total is 12, in our case that would give 12% agreement rate between set 1 and set 2.

Some others insist that this gives a skewed view of what is going on, since it implies that there was perfect agreement between individuals in set 1 in all 100 cases, which is incorrect. The way to deal with this is to take the 15 cases in which people in set 1 agreed with each other as the new 100% (since this is the total times set 1 agreed internally), and calculate from there, for an 80% agreement rate.

Others say that you should count not only total agreement between all involved, but also partial agreement (like when two people in set 1 say duck and set 2 also says duck), but then someone else insists that that should be counted as ‘partial agreement’ and given half a point, and that we should also have a ‘partial disagreement’ when only one person from set 1 agrees with the set 2 individual (this, they say, should be counted as 0.25). Let’s say that if we allow for partial agreement and disagreement we get something like 49 points, aka 49% agreement.

Yet others say that what matters is the number of times individuals in set 1 agree with the individual in set 2 (so in this view the mean for set 1 would be meaningless, as would talk of partial agreement/ disagreement); so, for example, if we had 210 individual agreements out of 300 possibles, that gives us an agreement percentage of 70%.

Metafilter, if you’ve stuck with us so far, can you help us wrap our heads around this? Which approach is good? Do individual approaches illustrate different things? If we wanted to arrive at something that could go by the name of ‘agreement rate between set 1 and 2’, which approach would be best?

I realize that this is all too piddly for proper statistics, but any input from that perspective would, nonetheless, be appreciated.

Thanks a lot!
posted by doggod to Grab Bag (16 answers total)
What's the end goal of calculating the agreement rate? What's it going to be used for? And perhaps more importantly why can't you just have multiple agreement rates with different names which are calculated differently?
posted by XMLicious at 9:00 AM on March 26, 2019 [2 favorites]

What was your hypothesis entering into the experiment?
posted by humboldt32 at 9:03 AM on March 26, 2019 [3 favorites]

A couple of questions that might help people sort out what you're aiming for:

Why are set 1 and set 2 being differentiated such that determining whether they agree with each other is meaningful, versus considering the 4 individuals as individuals?

What are you trying to learn about in this study? Animal recognition skills among the subjects? Clarity of the pictures?
posted by jacquilynne at 9:19 AM on March 26, 2019 [1 favorite]

Humph, I was hoping to avoid this question.

The thing is that I think noone had a clear idea when they set this up what precisely they were after, since it is apparent that they hadn't even figured out, going in, what precisely consitutes 'agreement'. I think there was the hazy notion that you can get an agreement percentage between set 1 and 2 (there are some differences in background between the two sets) and that this percentage would tell you something important. But noone involved in this knows anything about this kind of research - as though that needs pointing out!

The best approximation of what they want is to get a percentage for agreement that can (sort of) be defended. One possible statement would be 'people from background 1 disagree more frequently amongst themselves when it comes to identifying duck vs rabbit than they disagree with person from background 2'. Or some such.

So maybe my question can be rephrased as 'What are true statements that can be made bassed on the info froom the poll?'

Obviously, I don't know anything re. what to do with this info, either, hence why I'm asking Metafilter. For a full confession, I'll just say that I instinctively recoil from the '12%' interpretation, though I'm not sure that I have objective reasons to do so.
posted by doggod at 9:21 AM on March 26, 2019

jacquilynne, I hope that this doesn't amount to thread-sitting, but I want to answer because those are good points. I think this started as a picture clarity test and then it became apparent that there is quite a lot of subjectivity involved to the point that they decided to check if background was a factor.

Hence the two groups - as mentioned, individual 2 has a somewhat different background, so they wanted to see if this has an impact on recognition.

I'd rather not go into why they decided to ony have one individual in set 2; I hope that the info provided even so is enough for few pointers.

Thanks for your answers so far!
posted by doggod at 9:27 AM on March 26, 2019

It seems like the best approach might be to write defenses for each of the different calculation methods and pick the one that furnishes the best-sounding defense. In fact I would go so far as to say that four out of five dentists agree you should do that.
posted by XMLicious at 9:35 AM on March 26, 2019 [1 favorite]

To ask: these really were people looking at ambiguous rabbit/duck pictures? It's not that they were really reading a passage of text and saying whether they thought the text was/wasn't [some quality], or looking at photos of people and stating their perceived [some quality], or some other thing? I ask because it's often the case that people who've been curious about some specific thing have come up with pre-existing ways to look at and deal with that thing and canned statistical software that makes life easy.

Picture clarity perspective: look at percent agreement for each picture. Compare each to the distribution of agreement you'd get if everyone was just flipping coins, which is a doesn't-fit-in-my-head probability problem.

Background: you can't say anything about the effect of background. No matter what differences you observe between Group 1 and Person 4, it's fundamentally impossible for you to know what about Person 4 triggered that difference. Maybe it was their different background. Maybe it was that they hate rabbits. Maybe it was an undiagnosed color blindness. Maybe they're Ralphie Wiggum and were running around in a circle saying "Duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck..."
posted by GCU Sweet and Full of Grace at 9:46 AM on March 26, 2019 [1 favorite]

I think your sample size, as described, is obviously ridiculously too small to draw any conclusions from. But I think given what you've said, what you need is not to generalize group 1 and then compare the generalized group 1 response to a generalized group 2 response, but to look at a sort of pair-wise agreement. Is any given member of group 1 more likely to agree with any given member of group 2 or with another member of group 1.
posted by jacquilynne at 9:52 AM on March 26, 2019 [1 favorite]

I'd consider tossing Set 2's data entirely if it turned out that, say, they were colorblind and you weren't intending to be testing colorblindness but were accidentally doing so. In other words, if poor experimental design led to this second set, I'd mention that the data is only relevant to whatever group Set 1 belongs to and you didn't have sufficient data to extrapolate further.

But if, say, Set 2 is an expert in the field of duck/rabbit hybrids and Set 1 are citizen scientists being trained to become so, I'd frame it in the context of Set 2's answers.

"When Set 2 ID's a duck, Set 1 unanimously agrees with Set 2 65% of the time, unanimously disagrees 5% of the time, and has a mixed result 30% of the time." (or whatever numbers they are.) You could split it out by %; "a majority of Set 1 agrees 82% of the time." But I think that's part of deciding the story you want to tell--is it the agreements or the disagreements that are interesting?

>I'll just say that I instinctively recoil from the '12%' interpretation, though I'm not sure that I have objective reasons to do so.

It's because there is universal agreement 12% of the time, but that has nothing to do with Set 2 specifically. 85% of the time, someone in Set 1 is the problem. 80% of the time that Set 1 agrees, Set 2 votes in line with them.
posted by tchemgrrl at 9:53 AM on March 26, 2019 [1 favorite]

I think what you're looking for, statistically, is inter-rater reliability, specifically Cohen's/Fleiss' kappa. I do think it makes more sense to treat all 4 as individuals than have an unbalanced number in Set 1 and Set 2 (so, Fleiss rather than Cohen).
posted by basalganglia at 1:01 PM on March 26, 2019 [1 favorite]

Thanks a lot for your answers.

It becomes clear that if this is to be done in earnest, we need outside input from someone who knows what the heck they are doing. If it comes to that, who are we looking for? A statistician? A sociology with knowledge of statistics? Is there a professional who could help here? Including with things such as firming some research questions & concept definition prior to running the poll test, so we don't just end up throwing things at the wall in the hope that something sticks.

Thanks again for any pointers re. this issue.
posted by doggod at 1:27 PM on March 26, 2019

A statistician would help define your analysis plan and sample size (both of which should always be defined before you start), but they may or may not be helpful in defining your actual research question. What is the knowledge gap you are trying to bridge?

The fact that you mention "background" (educational? ethnic? religious?) suggests either sociology or maybe anthropology would be helpful, but the picture clarity thing sounds more like either something with vision or with photo processing techniques. Define your question as a question, then find your collaborators.
posted by basalganglia at 5:25 PM on March 26, 2019

It becomes clear that if this is to be done in earnest, we need outside input from someone who knows what the heck they are doing

This is kind of why PhDs exist - so you can learn how to carry out research projects properly under the tutelage of somebody who has done it before. Not trying to put you off (there’s a long history of self-taught clinician researchers in medicine, for sxample). But yes, the learning curve is going to be pretty steep if a) this really is rabbits and not “correct diagnosis rates in radiology trainees” or something, and b) you have no research background or domain knowledge whatsoever yourself.

The stats are not particularly hard - you probably wouldn’t need a professional statistician, assuming you have a reasonable level of stats literacy. It depends on what exactly you want to do with the results - if you want to publish this in a peer-reviewed journal, running things past a statistician will ensure no major flaws that a reviewer will pick up on. If this is just for fun, or for something low-stakes like a student project, you can probsbly do the stats yourself.

Finally, did you get/will you need ethics approval for this poll? And if you are collecting and storing background data on the participants, did you get written consent for this and are you storing the data appropriately? If you want to do anything with the data (like publish) you will need to demonstrate the study was carried out properly.
posted by tinkletown at 6:59 PM on March 26, 2019 [1 favorite]

This 538 article, "Which Justices Were BFFs This Supreme Court Term" (2018), has a 9-by-9 chart showing the agreement percentages between all pairs of Justices, with color as a visual aid. It's a pretty well-designed graphic IMO -- it compactly reports a lot of numbers, but you can still draw quick conclusions from it about ideological clusters on the court. You could make something like that with your four individuals' pairwise agreement rates.

However, GCU Sweet and Full of Grace is 100% correct that your data can't tell you what about the fourth person accounts for any differences in their perceptions. Not to be discouraging, but the answer to

What are true statements that can be made based on the info from the poll?

is "statements reporting the raw results of the poll". Any broader inferences/generalizations are not true statements but at best probable ones, and someone with literacy in stats can help you design a new poll that is potentially able to support such statements (the one you described almost certainly isn't).
posted by aws17576 at 9:52 PM on March 26, 2019 [3 favorites]

aws17576, that's a cool chart. I think the clustering is definitely a thing, and I bet doggod would see something similar in the duck/rabbit poll, but what that chart doesn't quite highlight enough is that the Supremes are more often in agreement with each other than in disagreement. The lowest percentage agreement on that chart is 51% agreement between Sotomayor and Alito, who are ideologically very distinct but who vote together as often as they vote apart. All the other pairs vote together substantially much more often than they vote apart. That's ... curious.

Anyway, this isn't the US politics thread, but I wanted to point it out as an example of the way stats and graphics can be used to illuminate but also to obscure. Lies and damned lies, after all.
posted by basalganglia at 2:34 PM on March 27, 2019

If you gave me your data and asked me to turn it into a chart, here’s what I would do:

1. Create an X axis, with “duck” on the left side and “rabbit” on the right.

2. Place five points along the axis, “100%/0%,” “75%/25%,” “50%/50%,” “25%/75%,” and “0%/100%.” These represent how many of the four people agreed it was a duck (first number) or a rabbit (second number).

3. Sort the 100 pictures among the five points. So if say there were 50 pictures that everyone agreed was a duck, I would put 50 icons of a duck/rabbit in the 100/0 column. If there were 37 pictures where 25% of people thought it was a duck and 75% thought it was a rabbit, I’d put 37 icons in the 25/75 row. Et cetera.

4. Then, for each of the 100 icons representing the 100 photos, I’d color the icon red if the person from Set 2 thought it was a duck. So all of the icons in the first column would be red, none of the icons in the last column would be red, and the three columns in between would presumably be a mix of red and not-red.

What would this chart show you?

A. Generally how ambiguous your collection of pictures are. If it looks like a bell curve, with very few icons in the 100% duck or 100% rabbit columns, then you have a collection of ambiguous images. If it’s a valley, with high points on the left and right and low points in the middle, then a lot of your images were very duck-like or rabbit-like and few were ambiguous. If the chart is pretty even, then it looks like a normal distribution.

B. It would show how much the person from Set 2 is in agreement with the people from set 1 as to what’s a duck or rabbit. If most of the red is to the left of the three inner points, he or she is in agreement with Set 1 as to what’s a duck. If most of the red is to the right, they’re at odds with Set 1.

I don’t think this chart would be terribly informative, but I think that’s the best information that can be squeezed from the data.
posted by ejs at 5:49 AM on March 29, 2019

« Older How do I use a Pivot Table to count across...   |   Drupal Views, report building - more like this Newer »

You are not logged in, either login or create an account to post comments