# Statistical QuestionJune 16, 2007 7:19 PM   Subscribe

(1) When doing an inter-rater reliability study, does Rater A always have to be paired with Rater B, and Rater C with Rater D, etc? Or can Rater A be paired with Rater B sometimes and Rater C other times? (2) When trying to determine the appropriate sample size for this kind of study, is the N based on the total number of subjects rated, or on the number of subjects each pair of raters rates?

This is required for determining the number of subjects to rated in a study that is currently being designed.
posted by kenberkun to Science & Nature (2 answers total)

In general, it does not seem to me that using different pairings of raters would lead to an improper conclusion.

However, there is a huge caveat here-- I can't really give good direct advice here because I do not know enough about the ratings you are comparing and the precise method you are using to generate your inter-rater reliability measure.

It may be that the exact test you are trying to use has as an assumption that all of the inter-rater pair difference scores are independent. If your statistical test does not consider this, you may be (probably slightly, but more so if you have fewer pairs) violating the underlying assumptions of the test. Maybe another test that explicitly allows you to pair the same rater against multiple other raters, and accounts for this repetition, would be better.

Just as an example, say you have four people grade the same 5 things apiece. You can compare the discrepancies of scores between A-B, A-C, A-D, B-C, B-D, and C-D, and have six measures.

Or, you could do the same thing with twelve people grading 5 things apiece. Your pairs would be A-B, C-D, E-F, G-H, I-J, and K-L, yielding six measures.

Both experiments apparently give six measurements. You have saved a lot of work in the first case, but the problem is that the first test uses the same rater more than once. If rater A gives horrible ratings, he will throw off half of the measurements in the first case, but only 1/6 measurements in the second case.

I hope this is a little helpful, and again, I can only give advice of limited usefulness without knowing more. Any more hints you can give?
posted by Maxwell_Smart at 8:50 PM on June 16, 2007 [1 favorite]

If this is at all serious, there seems to be a substantial literature on different approaches to measuring intercoder reliability under different circumstances, and you should consult that to see what methods best fit your circumstances. Googling "measuring intercoder reliability" turned up a stable of references.

There are certainly methods to estimate the extent to which different people say the same thing in response to the same stimulus -- vote scaling is a prominent example.

But whether you can do it, and should do it, is a different story. What you want to do is whatever the standard is for your type of study. Don't go off inventing something new (unless the whole point of your study is to assess different methods of ICR measurement)
posted by ROU_Xenophobe at 11:51 PM on June 16, 2007 [1 favorite]