Statistical Question
June 16, 2007 7:19 PM
Subscribe
(1) When doing an inter-rater reliability study, does Rater A always have to be paired with Rater B, and Rater C with Rater D, etc? Or can Rater A be paired with Rater B sometimes and Rater C other times?
(2) When trying to determine the appropriate sample size for this kind of study, is the N based on the total number of subjects rated, or on the number of subjects each pair of raters rates?
This is required for determining the number of subjects to rated in a study that is currently being designed.
posted by kenberkun to science & nature (2 comments total)
However, there is a huge caveat here-- I can't really give good direct advice here because I do not know enough about the ratings you are comparing and the precise method you are using to generate your inter-rater reliability measure.
When people think about statistics, they tend to think very hard and carefully about the math involved in the test that they are doing. Unfortunately, this is a good way to miss the big picture. Performing a statistical test is like asking a very precise question, subject to a large number of agreed-upon assumptions. To make sense of the answer, you need to make sure that you are asking the exact question you think you are asking, and that the underlying assumptions of your question are for the most part valid. Also, when you get your answer, you need to understand how it relates to the broad distribution of all possible answers that may have also been given to understand its significance.
It may be that the exact test you are trying to use has as an assumption that all of the inter-rater pair difference scores are independent. If your statistical test does not consider this, you may be (probably slightly, but more so if you have fewer pairs) violating the underlying assumptions of the test. Maybe another test that explicitly allows you to pair the same rater against multiple other raters, and accounts for this repetition, would be better.
Just as an example, say you have four people grade the same 5 things apiece. You can compare the discrepancies of scores between A-B, A-C, A-D, B-C, B-D, and C-D, and have six measures.
Or, you could do the same thing with twelve people grading 5 things apiece. Your pairs would be A-B, C-D, E-F, G-H, I-J, and K-L, yielding six measures.
Both experiments apparently give six measurements. You have saved a lot of work in the first case, but the problem is that the first test uses the same rater more than once. If rater A gives horrible ratings, he will throw off half of the measurements in the first case, but only 1/6 measurements in the second case.
I hope this is a little helpful, and again, I can only give advice of limited usefulness without knowing more. Any more hints you can give?
posted by Maxwell_Smart at 8:50 PM on June 16, 2007