Statistical Question
June 16, 2007 7:19 PM   Subscribe

(1) When doing an inter-rater reliability study, does Rater A always have to be paired with Rater B, and Rater C with Rater D, etc? Or can Rater A be paired with Rater B sometimes and Rater C other times? (2) When trying to determine the appropriate sample size for this kind of study, is the N based on the total number of subjects rated, or on the number of subjects each pair of raters rates?

This is required for determining the number of subjects to rated in a study that is currently being designed.
posted by kenberkun to Science & Nature (2 answers total)
In general, it does not seem to me that using different pairings of raters would lead to an improper conclusion.

However, there is a huge caveat here-- I can't really give good direct advice here because I do not know enough about the ratings you are comparing and the precise method you are using to generate your inter-rater reliability measure.

When people think about statistics, they tend to think very hard and carefully about the math involved in the test that they are doing. Unfortunately, this is a good way to miss the big picture. Performing a statistical test is like asking a very precise question, subject to a large number of agreed-upon assumptions. To make sense of the answer, you need to make sure that you are asking the exact question you think you are asking, and that the underlying assumptions of your question are for the most part valid. Also, when you get your answer, you need to understand how it relates to the broad distribution of all possible answers that may have also been given to understand its significance.

It may be that the exact test you are trying to use has as an assumption that all of the inter-rater pair difference scores are independent. If your statistical test does not consider this, you may be (probably slightly, but more so if you have fewer pairs) violating the underlying assumptions of the test. Maybe another test that explicitly allows you to pair the same rater against multiple other raters, and accounts for this repetition, would be better.

Just as an example, say you have four people grade the same 5 things apiece. You can compare the discrepancies of scores between A-B, A-C, A-D, B-C, B-D, and C-D, and have six measures.

Or, you could do the same thing with twelve people grading 5 things apiece. Your pairs would be A-B, C-D, E-F, G-H, I-J, and K-L, yielding six measures.

Both experiments apparently give six measurements. You have saved a lot of work in the first case, but the problem is that the first test uses the same rater more than once. If rater A gives horrible ratings, he will throw off half of the measurements in the first case, but only 1/6 measurements in the second case.

I hope this is a little helpful, and again, I can only give advice of limited usefulness without knowing more. Any more hints you can give?
posted by Maxwell_Smart at 8:50 PM on June 16, 2007 [1 favorite]

If this is at all serious, there seems to be a substantial literature on different approaches to measuring intercoder reliability under different circumstances, and you should consult that to see what methods best fit your circumstances. Googling "measuring intercoder reliability" turned up a stable of references.

There are certainly methods to estimate the extent to which different people say the same thing in response to the same stimulus -- vote scaling is a prominent example.

But whether you can do it, and should do it, is a different story. What you want to do is whatever the standard is for your type of study. Don't go off inventing something new (unless the whole point of your study is to assess different methods of ICR measurement)
posted by ROU_Xenophobe at 11:51 PM on June 16, 2007 [1 favorite]

« Older Downloading Videos Off Google: High Quality Off...   |   Getting the Ferret up to Speed Newer »
This thread is closed to new comments.