# How really really really reliable is this test?

November 26, 2007 2:48 PM Subscribe

[statsfilter] How to elegantly measure test-retest reliability after multiple (i.e. 4) repetitions of the same test?

In order to examine the temporal stability of a measure of a neuropsychological test we thought it would be wayyyy smart to do give subjects said test over a period of 2 weeks. It was our opinion that it's a good test and there would be little to no practice effects so scores should be stable not just at test-retest but at test-retest-retest-retest. You get the idea. The problem is that the standard Spearman-Brown reliability

coefficient is designed only for one retest scenarios. From what I can gather from this page I can use a 2-way random ICC treating the tests as different raters and get a sense of the reliability over all 4 measurements. Anyone know if this is indeed the proper test to be using? Any examples of a published psychology paper doing test-retest reliability for more than 2 tests?

In order to examine the temporal stability of a measure of a neuropsychological test we thought it would be wayyyy smart to do give subjects said test over a period of 2 weeks. It was our opinion that it's a good test and there would be little to no practice effects so scores should be stable not just at test-retest but at test-retest-retest-retest. You get the idea. The problem is that the standard Spearman-Brown reliability

coefficient is designed only for one retest scenarios. From what I can gather from this page I can use a 2-way random ICC treating the tests as different raters and get a sense of the reliability over all 4 measurements. Anyone know if this is indeed the proper test to be using? Any examples of a published psychology paper doing test-retest reliability for more than 2 tests?

I think the ICC is certainly a good way to go for this type of reliability analysis. I can't think of any references off the top of my head for papers that report reliabilities this way, but they shouldn't be too hard to find. You want to be looking for "interrater reliability" and "multiple raters" as keywords in your search for supporting references, but really the approach you're suggesting isn't controversial at all (at least in my field).

Also if I remember correctly, Cohen's Kappa can be used with more than 2 raters -- so that might be worth looking into. All of the methods of calculating reliability should yield comparable results, unless there's something really funky with your data.

posted by nixxon at 6:31 PM on November 26, 2007

Also if I remember correctly, Cohen's Kappa can be used with more than 2 raters -- so that might be worth looking into. All of the methods of calculating reliability should yield comparable results, unless there's something really funky with your data.

posted by nixxon at 6:31 PM on November 26, 2007

Simple regression r^2, or an anova result.

Your idea is that the jth score for the ith person (y_ij) is equal to some underlying number specific for that person (a_i) plus an error (e_ij). The estimator for a_i is going to be the average over j of (y_ij). You will then be looking at the ratio of the variance of the residuals from this model and the variance of all the data. The null hypothesis is that there is no a_i, that is the test is completely unreliable, which is the same as the there being just one mean for everybody's tests.

If you don't like to think of it as a regression, it's the same as a one-way anova with the hypothesis that the means are not equal. A super-duper power to detect that the means aren't the same is equivalent to the test being reliable (it tends very strongly to return a value close to the expected value every time).

posted by a robot made out of meat at 8:19 PM on November 26, 2007

Your idea is that the jth score for the ith person (y_ij) is equal to some underlying number specific for that person (a_i) plus an error (e_ij). The estimator for a_i is going to be the average over j of (y_ij). You will then be looking at the ratio of the variance of the residuals from this model and the variance of all the data. The null hypothesis is that there is no a_i, that is the test is completely unreliable, which is the same as the there being just one mean for everybody's tests.

If you don't like to think of it as a regression, it's the same as a one-way anova with the hypothesis that the means are not equal. A super-duper power to detect that the means aren't the same is equivalent to the test being reliable (it tends very strongly to return a value close to the expected value every time).

posted by a robot made out of meat at 8:19 PM on November 26, 2007

I suppose that I should add that the pearson correlation is this in two variables. If you want to do the regression, you add a dummy variable for each person but one. That variable is equal to one if the test belongs to that person, zero else. Regression would be a nice place to start because you can add a coefficient for alternative testing conditions, or to test if people are just getting better (or worse) at the test over time.

posted by a robot made out of meat at 10:22 AM on November 27, 2007

posted by a robot made out of meat at 10:22 AM on November 27, 2007

Response by poster: Thanks for the input.

Nixxon: It's rare to find the ICC used to examine test-retest reliability (that or my pubmed search skills are appalling). But I did find a few cases. I find a measure called Fleiss's Kappa which is the multi-rater version of Cohen's Kappa. Might try that out, though at the moment ICC seems fine.

Robot: I hadn't thought of it as a regression. We tried simple one way anova's to look at scores across testing periods and in one test they improve slightly (not all that surprising) and in another they're stable. The only issue I have with looking at this as an ANOVA is that reviewers will want a more standard reliability statistic. Looking at this as from a GLM perspective is interesting. I'm going to try that out.

Thanks!

posted by Smegoid at 1:18 PM on November 27, 2007

Nixxon: It's rare to find the ICC used to examine test-retest reliability (that or my pubmed search skills are appalling). But I did find a few cases. I find a measure called Fleiss's Kappa which is the multi-rater version of Cohen's Kappa. Might try that out, though at the moment ICC seems fine.

Robot: I hadn't thought of it as a regression. We tried simple one way anova's to look at scores across testing periods and in one test they improve slightly (not all that surprising) and in another they're stable. The only issue I have with looking at this as an ANOVA is that reviewers will want a more standard reliability statistic. Looking at this as from a GLM perspective is interesting. I'm going to try that out.

Thanks!

posted by Smegoid at 1:18 PM on November 27, 2007

This thread is closed to new comments.

posted by Smegoid at 2:49 PM on November 26, 2007