# Help me not screw up teaching my students about p-values

March 25, 2015 10:47 AM Subscribe

I am a TA in an introductory college biology lab. Soon, I have to teach a class of mostly freshmen about t-tests and p-values in conjunction with a field experiment that they are conducting and on which they will have to write a report. I want to make sure I do this rightâ€”help me!

So, I know that the venerable p-value is one of the most abused and misunderstood statistical tools in the sciences. I even wonder, sometimes, if I myself really have a strong grasp on what exactly p-values can and cannot say about data. I've been taught this stuff several times, but it tends to get a bit mushy after seeing the technique used so blithely in hundreds and hundreds of papersâ€”so I have a sort of generic skepticism of p-values, but I feel like I'm no longer totally clear on

I have about fifteen minutes to present this. The context is a project in which the students were asked to set pitfall traps for invertebrates in two different habitat types, gather and identify the specimens they caught, and then analyze their counts using t-tests (built into a provided Excel sheet) to look for significant differences in abundance and diversity. I want to be able to make absolutely clear to the students what it is that is happening in this test, and what the results can and cannot tell them about their data.

I would be happy to use an online video or other pre-built presentation, if it's a good one. Regardless, I want to make sure that I myself really know what's going on. So please, educate me. Straighten out for me, once and for all, what a t-test is for, what it does, and what the resulting p-value does and doesn't mean. Assume I know nothing, or that everything I know is wrong. Help me get my head on straight so that I can knock this one out of the park and avoid miseducating yet another group of future doctors, nurses, psychologists, and biologists.

So, I know that the venerable p-value is one of the most abused and misunderstood statistical tools in the sciences. I even wonder, sometimes, if I myself really have a strong grasp on what exactly p-values can and cannot say about data. I've been taught this stuff several times, but it tends to get a bit mushy after seeing the technique used so blithely in hundreds and hundreds of papersâ€”so I have a sort of generic skepticism of p-values, but I feel like I'm no longer totally clear on

*why*I am so skeptical. I want to improve my grasp of the subject, both so that I can be a better researcher and so that I can make sure not to fill the heads of my students with a bunch of nonsense.

I have about fifteen minutes to present this. The context is a project in which the students were asked to set pitfall traps for invertebrates in two different habitat types, gather and identify the specimens they caught, and then analyze their counts using t-tests (built into a provided Excel sheet) to look for significant differences in abundance and diversity. I want to be able to make absolutely clear to the students what it is that is happening in this test, and what the results can and cannot tell them about their data.

I would be happy to use an online video or other pre-built presentation, if it's a good one. Regardless, I want to make sure that I myself really know what's going on. So please, educate me. Straighten out for me, once and for all, what a t-test is for, what it does, and what the resulting p-value does and doesn't mean. Assume I know nothing, or that everything I know is wrong. Help me get my head on straight so that I can knock this one out of the park and avoid miseducating yet another group of future doctors, nurses, psychologists, and biologists.

I was going to blah blah blah at length never letting anyone get a word in edgewise until I foamed at the mouth and fell over backwards, but that Khan video is okay.

If you're going to beat something into their heads, it should probably be that people who say "A p-value is the probability of getting your results by chance" are WRONG and BAD people who will be quickly against the wall when the revolution comes and who will go to the special hell. The counterfactual nature of p-values is hard to wrap your head around, but it's part of the larger counterfactual-ness of classical(ish) hypothesis testing.

posted by ROU_Xenophobe at 11:27 AM on March 25, 2015 [2 favorites]

If you're going to beat something into their heads, it should probably be that people who say "A p-value is the probability of getting your results by chance" are WRONG and BAD people who will be quickly against the wall when the revolution comes and who will go to the special hell. The counterfactual nature of p-values is hard to wrap your head around, but it's part of the larger counterfactual-ness of classical(ish) hypothesis testing.

posted by ROU_Xenophobe at 11:27 AM on March 25, 2015 [2 favorites]

The current instantiation of the Wikipedia on p-value is not bad. I also like an article in

P-values are a case where I find using mathematical notation actually helps understand a complex concept. You can define

Unfortunately there's not a way to define the probability of a hypothesis that everyone would actually agree on. The idea of probability of a

posted by grouse at 11:30 AM on March 25, 2015 [4 favorites]

*Science*, "Mission Improbable: A Concise and Precise Definition of P-Value", which has an interview with someone "passionate about his p-values".P-values are a case where I find using mathematical notation actually helps understand a complex concept. You can define

*p*as the probability that, if you were given that the null hypothesis (*H*_{0}) were true, you would get a result*X*as extreme or more extreme than the one you actually observed*x*. In other words, you can usually represent the p-value asThe problem is most people don't really want to know the probability of their data given the null hypothesis. They want to know how likely one hypothesis or another is is given the data, which would be something more likep=P(X≥x|H_{0}).

*P*(*H*_{A}|*X*≥*x*).Unfortunately there's not a way to define the probability of a hypothesis that everyone would actually agree on. The idea of probability of a

*hypothesis*(rather than some specific observations) does not fit into an orthodox frequentist framework. A Bayesian could define it in terms of Bayes' rule, but this requires defining a prior probability and not everyone will agree on that.posted by grouse at 11:30 AM on March 25, 2015 [4 favorites]

There are some interesting thoughts (and experiments!) on p-values here, especially in terms of p-values and experimental replication. (Spoiler: Repeating the same experiment with random samples on the same data can give a pretty shocking range of p-values.)

posted by clawsoon at 12:56 PM on March 25, 2015

posted by clawsoon at 12:56 PM on March 25, 2015

It's a bit of a tangent from your exact question, but it's worth noting that not everyone loves the idea of P-values. If I were teaching undergrads about P-values, I would make sure they understand their limitations and where their application is warranted.

posted by Betelgeuse at 1:43 PM on March 25, 2015

posted by Betelgeuse at 1:43 PM on March 25, 2015

*If you use p=0.05 to suggest that you have made a discovery, you will be wrong at least 30% of the time.*

A highly readable summary of the problems with false positives, p-values, and t-tests.

posted by a lungful of dragon at 2:27 PM on March 25, 2015

The basics are, of course, that a p-value tells you what the probability is that, if the null hypothesis were true, you would have observed the data that you did. The problem is that people then want to use a p-value to decide whether results "mean something". Here are three reasons why this isn't valid:

1. As grouse said, You get the probability of the results given the null hypothesis, but you want the probability of the null hypothesis given the results. Going from one to the other is nontrivial and requires more information than the p-value itself contains.

2. The more P-values you collect, the more probable it is that one of them will be "significant" due to random chance. This is true both within a given experiment -- you should always keep track of how many comparisons you run and do a Bonferroni correction or similar on their significances -- and across experiments. If lots and lots of people are investigating similar things, chances are some of them will find significant results, and if they are the only ones to publish, bad things will happen.

3. What does probability even mean? You may think that this is philosophical navel-gazing, but the way you handle p-values really does change depending on what you think a probability is. For instance, say that you ran this lab and found that the results the students got weren't what you expected. Would that change your beliefs about these invertebrates, or indeed about the "outside world" in any way? Our would you assume that something was wrong with the lab setup? Why?

One thing that might be illustrative is teaching an alternative to p-values at the same time. The idea in this paper is that you specify what the alternative is -- none of these "we therefore must reject the null hypothesis" nonsense -- and then use the statistics to tell you which hypothesis is more likely (the null or the alternative), and by how much. Why does this get around some of the problems of p-values? What are its strengths and weaknesses?

This is a lot for 15 minutes, but God knows

posted by goingonit at 2:31 PM on March 25, 2015 [1 favorite]

1. As grouse said, You get the probability of the results given the null hypothesis, but you want the probability of the null hypothesis given the results. Going from one to the other is nontrivial and requires more information than the p-value itself contains.

2. The more P-values you collect, the more probable it is that one of them will be "significant" due to random chance. This is true both within a given experiment -- you should always keep track of how many comparisons you run and do a Bonferroni correction or similar on their significances -- and across experiments. If lots and lots of people are investigating similar things, chances are some of them will find significant results, and if they are the only ones to publish, bad things will happen.

3. What does probability even mean? You may think that this is philosophical navel-gazing, but the way you handle p-values really does change depending on what you think a probability is. For instance, say that you ran this lab and found that the results the students got weren't what you expected. Would that change your beliefs about these invertebrates, or indeed about the "outside world" in any way? Our would you assume that something was wrong with the lab setup? Why?

One thing that might be illustrative is teaching an alternative to p-values at the same time. The idea in this paper is that you specify what the alternative is -- none of these "we therefore must reject the null hypothesis" nonsense -- and then use the statistics to tell you which hypothesis is more likely (the null or the alternative), and by how much. Why does this get around some of the problems of p-values? What are its strengths and weaknesses?

This is a lot for 15 minutes, but God knows

*someone's*got to do this right at some point.posted by goingonit at 2:31 PM on March 25, 2015 [1 favorite]

Might be useful to (very simply) summarize a few of the big p-value pitfalls for your students so they have some perspective:

One common mistake (made by many professional scientists, believe it or not!) is that people will gather data until a "significant" p value (typically p<0.05) is reached, and then will stop gathering data or doing experiments. This is cheating!

Usually p<0.05 is considered to be statistically significant and p<0.01 is highly significant, but these cutoffs are

Sample bias (e.g. sample distribution isn't representative of the true population distribution) can give a p-value that is misleadingly too large or too small.

Very small effect sizes (difference between control and experimental groups) are suspect even if the p-value is small.

posted by phoenix_rising at 2:35 PM on March 25, 2015 [1 favorite]

One common mistake (made by many professional scientists, believe it or not!) is that people will gather data until a "significant" p value (typically p<0.05) is reached, and then will stop gathering data or doing experiments. This is cheating!

Usually p<0.05 is considered to be statistically significant and p<0.01 is highly significant, but these cutoffs are

**arbitrary**and don't necessarily mean that the effect is "real."Sample bias (e.g. sample distribution isn't representative of the true population distribution) can give a p-value that is misleadingly too large or too small.

Very small effect sizes (difference between control and experimental groups) are suspect even if the p-value is small.

posted by phoenix_rising at 2:35 PM on March 25, 2015 [1 favorite]

Make sure they learn that a smaller p-value means that we are more certain that the effect is not due to chance, not that the effect ITSELF is stronger.

posted by wittgenstein at 2:37 PM on March 25, 2015 [1 favorite]

posted by wittgenstein at 2:37 PM on March 25, 2015 [1 favorite]

*p-value tells you what the probability is that, if the null hypothesis were true, you would have observed the data that you did*

Close. It tells you the probability that you would have observed the data that you did, or more extreme data. So for a

*t*-test, if you calculate based on your observed data

*t*= 2, the p-value is the probability that, if the null hypothesis were true, you would have observed

*t*≥ 2.

posted by grouse at 3:23 PM on March 25, 2015

I suggest giving them this cartoon:

https://xkcd.com/882/

posted by HoraceH at 3:26 PM on March 25, 2015 [2 favorites]

https://xkcd.com/882/

posted by HoraceH at 3:26 PM on March 25, 2015 [2 favorites]

Can I piggyback on this question to ask why there is a resistance to Bayesian hypothesis testing? I accept that, as grouse mentions, it can be difficult to find an appropriate prior. But in such a case, can't we just use an uninformative prior? Assuming that the cost function does not penalize one error more than the other, then the test just becomes a likelihood ratio test, right? And that's a frequentist thing?

posted by tickingclock at 3:55 PM on March 25, 2015

posted by tickingclock at 3:55 PM on March 25, 2015

It's difficult to get people to get people to

Joseph Simmons, Leif Nelson, and Uri Simonsohn wrote a paper, "False-Positive Psychology", that introduces a now-influential concept of researcher degrees of freedom, the number of different choices researchers can make when collecting and analyzing data. Adding Bayesian approaches as an alternative to frequentist approaches increases the researcher degrees of freedom:

posted by grouse at 4:53 PM on March 25, 2015 [2 favorites]

*agree*on what the prior should be. People will try a number of different priors and the one that gives a positive result will get published.Joseph Simmons, Leif Nelson, and Uri Simonsohn wrote a paper, "False-Positive Psychology", that introduces a now-influential concept of researcher degrees of freedom, the number of different choices researchers can make when collecting and analyzing data. Adding Bayesian approaches as an alternative to frequentist approaches increases the researcher degrees of freedom:

Although the Bayesian approach has many virtues, it actually increases researcher degrees of freedom. First, it offers a new set of analyses (in addition to all frequentist ones) that authors could flexibly try out on their data. Second, Bayesian statistics require making additional judgments (e.g., the prior distribution) on a case-by-case basis, providing yet more researcher degrees of freedom.I dislike p-values because they are so poorly understood and because they don't really answer the question people want them to. But to replace them effectively with Bayesian methods we'd have to agree on the nature of the replacement in advance, and not give individual scientists the choice of frequentist methods or whatever Bayesian method gives a good result.

posted by grouse at 4:53 PM on March 25, 2015 [2 favorites]

This thread is closed to new comments.

posted by radsqd at 10:58 AM on March 25, 2015 [3 favorites]