Is my reviewer an idiot?
November 8, 2008 8:03 AM   RSS feed for this thread Subscribe

I have a within-subject experiment. I have five subjects; the effect was significant at the 0.05 level in four of the five (p=0.04, 0.02, 0.02, 0.09, and 0.01). A reviewer says that because of that 0.09, I need to run another five subjects (which would cost about $3000). Is he an idiot? If so, what argument can I make to convince him? If not, why not?
posted by anonymous to science & nature (18 comments total)
I take it you've done something like a within-subjects ANOVA for the group and that as a whole, they show a significant effect? I do think it's a little weird for your reviewer to harp on one result (which is still marginally significant) when everything else is kosher, but then, 5 is kind of a low n. How many subjects are standard in your field/sub-field?
posted by shaun uh at 8:20 AM on November 8, 2008


This question is impossible to answer with the information given(at least for me). What test are you using? Was there something different about the 0.09 outlier? Have you spoken to a statistician? I've found doing this incredibly helpful. You may mot be using the best test for this experiment. Or if you are you may not have performed it correctly.

A comment like this always makes me a little suspicious. You've designed an experiment, carried it out and these are the results. It reminds me of a time I was at a research station and a very well respected scientist told his student to keep running an experiment until the results are significant. Your data are what they are.
posted by a22lamia at 8:24 AM on November 8, 2008


I'd take a look at the explanation and Table 13.1 here to see if this need to adjust your level of significance applies in your situation. IANABS. (BS=biostatistician :))
posted by NikitaNikita at 8:25 AM on November 8, 2008


To clarify/highlight what sticks out from that site I linked to, I interpret it as meaning that you need to adjust your significance to 0.0102, given the n. As a22lamia points out, we have no idea what you did to arrive at the individual p values for your subjects, so that will surely factor in to the issue.
posted by NikitaNikita at 8:27 AM on November 8, 2008


er, adjust what you INTERPRET AS significance, I should say.
posted by NikitaNikita at 8:28 AM on November 8, 2008


You probably need to come up with a pooled analysis. It seems very strange to me that you'd be reporting p-values for the five subjects as if you were testing five different hypotheses. I bet that you really have one hypothesis which is the same for each subject (maybe with per-subject parameters). MeMail me and I'll give you 10 minutes free of statistical thinking. IAABS grad student.
posted by a robot made out of meat at 8:37 AM on November 8, 2008 [1 favorite]


I'm assuming from context that this came back with a revise and resubmit from the editor.

In which case, look at the editor's letter for hints about what to do.
posted by ROU_Xenophobe at 8:57 AM on November 8, 2008


follow-up from the OP "This is a fMRI experiment. 5 subjects is pretty standard for the type of experiment we are doing. Group analysis gives p=0.0003"
posted by jessamyn at 9:15 AM on November 8, 2008


I agree with a robot made out of meat. It seems odd that you are reporting multiple p-values here.
posted by grouse at 9:16 AM on November 8, 2008


Your reviewer does not understand his job, which is to assess whether the research you have done makes a useful contribution to your field. Asking an author to do additional costly data collection is ridiculous. If he's thinks the paper's findings cannot be trusted because of this one result, then he should have recommended reject, not R&R. Otherwise he should let it go.

In this situation I would have a conversation with the journal editor for guidance.
posted by shadow vector at 9:52 AM on November 8, 2008


5 subjects for an fMRI experiment? In cognitive neuroscience, which is my sub-field, we very rarely see published studies with less than ten subjects, and it's usually closer to 20. (In the fMRI studies I've worked on, we've had 30-40.) To be honest, the lack of standardization in fMRI as a whole would make me skeptical about such a small experiment, even though you do seem to be getting a pretty strong effect.

Of course, you're probably working in a different sub-field with different standards, so what I just said might not apply at all. But not knowing the details...
posted by shaun uh at 10:13 AM on November 8, 2008


I agree with shadow vector, but it's possible we're not hearing the whole story. The reviewer may be upset with a claim of importance or statistical significance that they feel isn't supported by the results. The author may be interpreting this as a demand to run more experiments.

Is he an idiot? If so, what argument can I make to convince him?

If you based the statistical analysis on a standard technique, reference the previous literature. If you derived it yourself, show the derivation. If your analysis includes a bunch of p-values that were calculated without understanding their meaning, then you shouldn't be publishing it.
posted by Mapes at 10:16 AM on November 8, 2008 [1 favorite]


I'm guessing that you have something like { {mean activity@voxels/region[t=0]} != {mean activity@voxels/region[t=1]} } If you already have the pooled analysis it seems like that's what you want to report. I report subset-analysis with NS p-values all the time; that's why you have the sample size you have instead of trying to do it off one subject. No matter how many subjects you have, the distribution of the per-subject test statistic shouldn't change. If you doubled the subject count you would probably have another >0.05.

It's probably just a matter of presentation. If you have a table (or equivalent figure) somewhere like
subject mean[t=1] mean[t=2] p-value

Then you should replace it with a figure like http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=509423&rendertype=figure&id=F2 where you show the group means at times and report the trends result. Also see the figures here http://www.sciencemag.org/cgi/content/abstract/293/5537/2105

On the other hand, if you have some quantitative measure of something happening which you are coorelating to activity somewhere, see the figures here: http://www.sciencemag.org/cgi/content/abstract/302/5643/290
posted by a robot made out of meat at 11:44 AM on November 8, 2008


Oh, and disclaimer I don't work on any radiology projects. Mapes is right, look for an fMRI study in Science with a similar experimental setup, and report your results the way they do. As long as you're making changes, sound conciliatory in your reply letter and say "We have changed figure/table X to a more standard presentation citation citation citation."
posted by a robot made out of meat at 11:58 AM on November 8, 2008


So what does everyone think of this idea:

Anon has a regression coefficient per-experiment, and wants to test the omnibus hypothesis that betas!=0. I suggest a serial anova on a combined model with dummy variables and interaction terms for the different experiments, looking at the output row for the first relevant coefficient (it and its interactions going last in the model).

For presentation I suggest a forest plot of the betas +- se from different experiments to show that it's not being driven by one or two experiments.
posted by a robot made out of meat at 8:58 AM on November 9, 2008


The Cult of Statistical Significance
posted by milkrate at 12:16 AM on November 10, 2008


If the null hypothesis is true: you should get a statistically significant result only one out of 20 times. If the null hypothesis is false: you should get one with probability p, where p is the power of your (within-individual) test. Considering that 80% of your tests had statistically significant results, that's pretty good intuitive evidence that there really is a difference.

Caveats: I don't know what the error rate on my intuitive method is. Plus, statistical significance is overrated.
posted by parudox at 12:59 PM on November 10, 2008


Anon has a regression coefficient per-experiment, and wants to test the omnibus hypothesis that betas!=0. I suggest a serial anova on a combined model with dummy variables and interaction terms for the different experiments, looking at the output row for the first relevant coefficient (it and its interactions going last in the model).

Wouldn't it be easier just to run a multilevel model?
posted by ROU_Xenophobe at 5:02 PM on November 10, 2008


« Older Should I replace the transmiss...   |   I'm looking for recommendation... Newer »
This thread is closed to new comments.