# Statistical power and the validity of experimental results

July 27, 2014 6:03 AM Subscribe

What is the relationship between statistical power and the validity of an experiment in general? If I have low power, but a very low p value, am I still OK?

I work for a dietary supplement company that also makes skin care products, and some of those products are tested clinically. Now they are talking about repeating some of the clinical tests in another region of the world in which the products will be sold because the marketing department thinks that would be good, and the question came up: How do you decide how many subjects to include?

The answer, I learn, is statistical power. You include as many subjects as are needed to give you an appropriate power, say 80%, given your chosen significance level alpha and expected effect size, for the type of statistical test you are performing. Power is 1 - beta, beta being the chance of making a type II error. So if your study is underpowered, your beta is large, and the chance of missing an effect that is actually there is high. Meanwhile, alpha is the chance of making a type I error, or seeing an effect when there is really nothing there. You compare your calculated p value to alpha, and if p is lower, your results are statistically significant and you can conclude that your observed effect is unlikely to be due to chance. But what if p is small, lower than alpha, but power is also low? Does that invalidate the results?

For example, one of the studies they want to repeat used a two-tailed paired t test to compare before and after treatment means for a measurement. Alpha was 0.05 and population size was 30. After the fact, I calculated a Cohen's d of 0.49. All this gives a power of less than 50%, which means the study was underpowered. At the same time, p was 0.000004, much lower than alpha. I can tell the people at work that, when the study is repeated, we are going to need more subjects or we risk missing the effect that we saw in the first trial, but what can we conclude about that first trial? Power was low, but p was much lower than alpha. Are the results no good? Or can we still trust that p? Or, what is also possible, am I completely confused and all this doesn't work the way I think?

Thanks for any help you can provide!

I work for a dietary supplement company that also makes skin care products, and some of those products are tested clinically. Now they are talking about repeating some of the clinical tests in another region of the world in which the products will be sold because the marketing department thinks that would be good, and the question came up: How do you decide how many subjects to include?

The answer, I learn, is statistical power. You include as many subjects as are needed to give you an appropriate power, say 80%, given your chosen significance level alpha and expected effect size, for the type of statistical test you are performing. Power is 1 - beta, beta being the chance of making a type II error. So if your study is underpowered, your beta is large, and the chance of missing an effect that is actually there is high. Meanwhile, alpha is the chance of making a type I error, or seeing an effect when there is really nothing there. You compare your calculated p value to alpha, and if p is lower, your results are statistically significant and you can conclude that your observed effect is unlikely to be due to chance. But what if p is small, lower than alpha, but power is also low? Does that invalidate the results?

For example, one of the studies they want to repeat used a two-tailed paired t test to compare before and after treatment means for a measurement. Alpha was 0.05 and population size was 30. After the fact, I calculated a Cohen's d of 0.49. All this gives a power of less than 50%, which means the study was underpowered. At the same time, p was 0.000004, much lower than alpha. I can tell the people at work that, when the study is repeated, we are going to need more subjects or we risk missing the effect that we saw in the first trial, but what can we conclude about that first trial? Power was low, but p was much lower than alpha. Are the results no good? Or can we still trust that p? Or, what is also possible, am I completely confused and all this doesn't work the way I think?

Thanks for any help you can provide!

Best answer: The results from the first study are still fine as far as the N goes, though as metasarah notes you always have to worry about the study's design and execution. If the first study had shown no effect or a marginal effect, you would worry about whether that was because there really is no difference between before and after use or because the study couldn't distinguish a real effect from no effect. But it didn't.

"when the study is repeated, we are going to need more subjects or we risk missing the effect that we saw in the first trial" is a pretty good way to think about it.

posted by ROU_Xenophobe at 6:32 AM on July 27, 2014 [1 favorite]

"when the study is repeated, we are going to need more subjects or we risk missing the effect that we saw in the first trial" is a pretty good way to think about it.

posted by ROU_Xenophobe at 6:32 AM on July 27, 2014 [1 favorite]

Supposed to be working, so although I try to skip these types of questions, I am going to give this a go and this might appear to be over the top, but I have written papers for medications (although someone pays me to do so, which is in itself a bias, but let's leave that aside for now).

To me, a study that is underpowered does not mean anything, even if you have spectacular P values. It either is or is not <0>

It would not throw your study out the window, though, because you did a small pilot study, saw activity in (endpoint 1), and want to explore that now with a sufficiently powered study.

This is what I look for in a well-designed study. If this is eventually published, it should include these things, but I look for things in this order:

• A statement in the paper along the lines of "To be sufficiently powered to (I usually see 90%, but whatever), XXX patients needed to be enrolled" in the methods section. Almost every paper that I have ever seen has along the lines of hundreds, not 30, not 100.* But before even doing the study, do the calculation, aim to enroll a bit over that number (ie, people drop out).

• What are the predetermined primary and secondary endpoints? This should also be in the methods, because you don't want to start citing all these other benefits of the drug if it was not designed to evaluate it.

• Characteristics of the study - double-blind, placebo, this is a given if you do a study of a certain size.

• Study population - ideally, people who would be representative of the people who will use this. But unfortunately, you will often see a study with phenomenal results (improved survival, etc.) and the users are intended to be males and females of all races, but the study population has 90% white males (you still see this). So ask in advance: Who should be your study population? Since this is a supplement, my guess is that this is not applicable, but in a clinical setting, you also want the study population to be as sick as the people who would normally take the medication (ie, you might see a study have "exclusion criteria" to knock out some of the sicker people).

• Efficacy endpoints reported with at least some reported to be <0> • Adverse events. List them. Are there any serious ones - report this in the study. Ideally, report P values for this, but you don't always see this done.

• Is this a local population or an international, multisite study? Ideally, from all over the world. Believe it or not, some subsets might respond different for unidentified reasons , such as a gene/ mutation, the list goes on and on.

• Ethics of the study - so pay your subjects for their time to be poked and prodded, but not gifts above and beyond (don't want to bias people). If this is conducted in developing countries, follow local guidelines (does everyone agree and know what they are signing). There is usually a sentence or two stating this in the methods, too.

*There are exceptions to this might be a rare disorder where it is difficult to enroll enough patients with a disease. OR it is a disease with a high need to treat (nothing else exists as therapy), and a phase I or II study with a few patients found benefit ...but the next step is often to do more studies with a larger population.

To go above and beyond:

• Are the endpoints clinically relevant? So it can be statistically significant, but not necessarily clinically relevant.

• Are the endpoints of value to the population who will be receiving it and/or clinicians? So what would be helpful would be to figure out what these things are in advance. Have a few meetings with clinicians and ask what they would want to see/what is important to them. Have a separate one with patients. Then these things can be added to the protocol and be considered as secondary endpoints, or even exploratory endpoints, but not something thrown on 5 years later.

• Compare your medication to...better than placebo, an active control group (ie, something already indicated for that therapeutic area and acknowledged by guidelines to be an up-to-date choice (not historic from 20 years ago) and the first recommended drug. Some people balk against this and you can imagine why.

If I were in your shoes, seriously, get a statistician to work on this before it is done.

posted by Wolfster at 8:02 AM on July 27, 2014 [5 favorites]

To me, a study that is underpowered does not mean anything, even if you have spectacular P values. It either is or is not <0>

It would not throw your study out the window, though, because you did a small pilot study, saw activity in (endpoint 1), and want to explore that now with a sufficiently powered study.

This is what I look for in a well-designed study. If this is eventually published, it should include these things, but I look for things in this order:

• A statement in the paper along the lines of "To be sufficiently powered to (I usually see 90%, but whatever), XXX patients needed to be enrolled" in the methods section. Almost every paper that I have ever seen has along the lines of hundreds, not 30, not 100.* But before even doing the study, do the calculation, aim to enroll a bit over that number (ie, people drop out).

• What are the predetermined primary and secondary endpoints? This should also be in the methods, because you don't want to start citing all these other benefits of the drug if it was not designed to evaluate it.

• Characteristics of the study - double-blind, placebo, this is a given if you do a study of a certain size.

• Study population - ideally, people who would be representative of the people who will use this. But unfortunately, you will often see a study with phenomenal results (improved survival, etc.) and the users are intended to be males and females of all races, but the study population has 90% white males (you still see this). So ask in advance: Who should be your study population? Since this is a supplement, my guess is that this is not applicable, but in a clinical setting, you also want the study population to be as sick as the people who would normally take the medication (ie, you might see a study have "exclusion criteria" to knock out some of the sicker people).

• Efficacy endpoints reported with at least some reported to be <0> • Adverse events. List them. Are there any serious ones - report this in the study. Ideally, report P values for this, but you don't always see this done.

• Is this a local population or an international, multisite study? Ideally, from all over the world. Believe it or not, some subsets might respond different for unidentified reasons , such as a gene/ mutation, the list goes on and on.

• Ethics of the study - so pay your subjects for their time to be poked and prodded, but not gifts above and beyond (don't want to bias people). If this is conducted in developing countries, follow local guidelines (does everyone agree and know what they are signing). There is usually a sentence or two stating this in the methods, too.

*There are exceptions to this might be a rare disorder where it is difficult to enroll enough patients with a disease. OR it is a disease with a high need to treat (nothing else exists as therapy), and a phase I or II study with a few patients found benefit ...but the next step is often to do more studies with a larger population.

To go above and beyond:

• Are the endpoints clinically relevant? So it can be statistically significant, but not necessarily clinically relevant.

• Are the endpoints of value to the population who will be receiving it and/or clinicians? So what would be helpful would be to figure out what these things are in advance. Have a few meetings with clinicians and ask what they would want to see/what is important to them. Have a separate one with patients. Then these things can be added to the protocol and be considered as secondary endpoints, or even exploratory endpoints, but not something thrown on 5 years later.

• Compare your medication to...better than placebo, an active control group (ie, something already indicated for that therapeutic area and acknowledged by guidelines to be an up-to-date choice (not historic from 20 years ago) and the first recommended drug. Some people balk against this and you can imagine why.

If I were in your shoes, seriously, get a statistician to work on this before it is done.

posted by Wolfster at 8:02 AM on July 27, 2014 [5 favorites]

You should also consider asking your question at Cross Validated, a StackExchange site specifically for statistics questions.

posted by number9dream at 8:47 AM on July 27, 2014

posted by number9dream at 8:47 AM on July 27, 2014

Best answer: With respect to what to tell people at work, I think you just repeat what the statistics say more or less as is:

The probability of seeing data as extreme as what was observed given that the null is true is low. (That's your p-value. Note that this is NOT the same as saying that the null itself is unlikely to be true.)

The probability of rejecting the null with your sample size given that the null is false (and that the truth is very similar to what you actually observed) was also low.

Two things on this second bit. First, you only get the result you report for the power of your test if you assume that the truth is close to what you actually observed. For all you know, the truth is more extreme than the observation indicates, and the

Anyway, after laying out what the statistics literally say, you can say something interpretive along the following lines. It is possible that although the null is true, we got an unlucky sample. However, if the sample accurately represents the population and the null is really false

Then you can do a power analysis

Having done all that, you should stop doing frequentist statistics and become a Bayesian. ;)

posted by Jonathan Livengood at 9:49 AM on July 27, 2014 [1 favorite]

The probability of seeing data as extreme as what was observed given that the null is true is low. (That's your p-value. Note that this is NOT the same as saying that the null itself is unlikely to be true.)

The probability of rejecting the null with your sample size given that the null is false (and that the truth is very similar to what you actually observed) was also low.

Two things on this second bit. First, you only get the result you report for the power of your test if you assume that the truth is close to what you actually observed. For all you know, the truth is more extreme than the observation indicates, and the

*real*effect size is much larger than d=0.49. If d is actually large enough, then the study isn't under-powered. (For example, suppose you measure x1_bar = 0.99, x2_bar = 0.5, and pooled_sd_hat = 1, but the true values are mu_1 = 1.2, mu_2 = 0.4, and pooled_sd = 1.1. Then you have an observed effect size of 0.49 but a true effect size of 0.7272, with a corresponding power of 0.79 for N=30 subjects per condition.) So, you really want to say that the study has power relative to a specific assumed effect size. And that effect size comes from your pre-existing background knowledge (or beliefs). In your follow-up study, you have some reason to think the true effect size is around d=0.49 based on the results of this study. But if you literally had no expectations going in, then you couldn't do a power analysis except to lay out a range of possibilities like, "Given this many subjects, we have such and so much power to detect an effect of size BLAH, such and so much power to detect an effect of size BLERG, and so on." Second, a bit of a nitpick, I'm not sure from your post whether you had N=30*for each condition*in your study, in which case your power was about 0.46 or whether you had N=30*total*, in which case your power was only about 0.25.Anyway, after laying out what the statistics literally say, you can say something interpretive along the following lines. It is possible that although the null is true, we got an unlucky sample. However, if the sample accurately represents the population and the null is really false

*in this specific way*, then we are unlikely to*detect*that the null is false in a future experiment unless we have a lot more subjects.Then you can do a power analysis

*before collecting more data*to make sure that you are going to have a good shot at rejecting the null again if the null is, in fact, false.Having done all that, you should stop doing frequentist statistics and become a Bayesian. ;)

posted by Jonathan Livengood at 9:49 AM on July 27, 2014 [1 favorite]

Forgive me, but firstly a glance at the data you provide suggest to me that something doesn't add up with your math. For a d=0.49, and a sample size of 30 (ie degrees of freedom=29), there's no way you should have a p-value that low. My rough back-envelope calculation suggests that that is nowhere near a large enough sample size to reach statistical significance at alpha=0.05.

Secondly, post-hoc power calculations are pointless. They mean essentially nothing with respect to the interpretation of results

posted by drpynchon at 10:49 AM on July 27, 2014

Secondly, post-hoc power calculations are pointless. They mean essentially nothing with respect to the interpretation of results

*already*obtained, and if you use the outcomes from the study you are post-hoc analyzing to estimate power, you are just restating what the study already demonstrated in a different form. This is another reason why your calculations seem off. To wit, if a study yielded a p exactly = 0.05, then for an alpha = 0.05, a post-hoc power calculation for that study using the obtained effect size and sample number should yield a power estimate of about 50% (and exactly 0.50 as the degrees of freedom approach infinity). From there, the lower the p-value, the higher the post-hoc power.posted by drpynchon at 10:49 AM on July 27, 2014

Sorry error on my part. In this scenario Cohen's d = t/rt(N) where t is the t-score and N is the number of pairs. Running the math for a

You really should be consulting a professional statistician.

posted by drpynchon at 1:31 PM on July 27, 2014 [1 favorite]

*paired*t-test, if your d=0.49 and you have 30 pairs, that's a t-score of 2.68, yielding a p of 0.012, not 0.000004. Where did that number come from? By my math your post-hoc calculation of power at alpha = 0.05 should be 73%. To get to 80% power you would need to design a trial with 35 patients. But don't trust my math. I'm really hungover.You really should be consulting a professional statistician.

posted by drpynchon at 1:31 PM on July 27, 2014 [1 favorite]

This thread is closed to new comments.

That said, I personally don't trust sample sizes of under 100 because just too people can make too big a difference.

posted by metasarah at 6:29 AM on July 27, 2014 [1 favorite]