How do I explain sample size in layman's terms?
November 12, 2014 8:54 AM   Subscribe

How do I explain the principles of minimum detectable effect, statistical power, and statistical significance to a client?

My client's business has 100k monthly visitors and a 1% conversion rate on their home page. After using Optimizely to test a new home page, they're ready to call it quits after 2k visitors, saying they usually "spot problems in the funnel in less than 1k visits."

How do I explain the principles of minimum detectable effect, statistical power, sample size, and statistical significance (like Optimizely's A/B Test Sample Size Calculator) in layman's terms?
posted by Avenger50 to Technology (13 answers total) 8 users marked this as a favorite
I wouldn't get too technical. I'd use an analogy. Maybe doctors are trying to figure out whether people become energetic when they take vitamin D. So they split people randomly into groups who take the vitamin and those who don't and ask participants how energetic they feel.

But many other things also affect energy -- like sleep, stress, hydration, etc. If the pill made a difference, but only a small one, wouldn't it be hard to tell given all these other factors that might swamp it? But the doctors might care even about a small difference, because over a large population, even a minor increase in energy levels would be a worthwhile discovery.

How many people would the doctors need to test to see if the vitamin made a difference? Well, if you have a large enough group of people, the other factors tend to wash out, and the difference the pill makes become clearer. The larger the group of people, the more powerful the magnifying glass that enables you to see the effect of the particular thing you're investigating. Of course, if the thing makes a huge difference, it will be apparent even with a relatively small group of people. But if you want to make sure you want to catch even very small differences, you need quite a powerful magnifying glass.

At a certain point, though, if the difference is absolutely miniscule, you stop caring. So you have to figure out how small a difference still matters.

So statisticians have calculated the exact numbers needed, depending on how small a difference you want to be able to detect with confidence that you won't miss a difference that really is there.

And in this case, say, if you care about this small a difference in your conversion rate, say, 2k people just isn't enough... say, 10k is needed.
posted by shivohum at 9:29 AM on November 12, 2014 [2 favorites]

"I can calculate the probability that the results we are seeing are due to CHANCE, not due to the effect of the new home page.

Since we're testing this, we want to be sure of the effect of the new home page - not just a random blip in the numbers. We need more visits to reliably do this. This isn't my opinion - this is the foundation of statistical analysis."

You can also show them Optimizely's calculator and walk them through the concepts.
posted by entropone at 9:30 AM on November 12, 2014 [4 favorites]

I'm always a fan of the bag of marbles/deck of cards analogies. How many white marbles W would you have to pull out of a bag of N marbles to get an idea the proportion of W/N?
posted by klangklangston at 9:33 AM on November 12, 2014 [1 favorite]

If you're talking about user experience issues leading to abandonment, it's true that user experience problems can be spotted with a smaller sample than would be statistically acceptable. That might be what they're referring to.
posted by bleep at 10:00 AM on November 12, 2014

I don't think you should feel compelled to explain those particular topics, even if your knowledge of them underlies your understanding.

I imagine your client already understands that you can test things and draw conclusions from a relatively small sample. If you think they are drawing conclusions too quickly, then explain that particular point. There's no need to go back to Statistics 101 to produce an entire conceptual edifice.

What's the chance that there is a meaningful improvement in the "true" conversion rate (say up to 1.5%) but the observed conversion rate shows a decrease? Show them that number. Or give them a confidence interval for the new conversion rate, if it's really wide they'll understand that the data hasn't narrowed things down yet.

And be wary of assuming that the client is dumb. The client is running a business, not doing a statistics problem set. If it takes until the heat death of the universe to figure out if something works or not, or if only unreasonably large improvements can be reasonably detected, then the client has to try things out and rely on a combination of noisy data and gut instinct to make decisions. There's nothing wrong with making a decision even if the p-value is bigger than 0.05.
posted by leopard at 10:22 AM on November 12, 2014 [7 favorites]

Rather than the two group comparison sample size, I like to focus on the more simple single group.

Let's say a new drug kills one out of a thousand test patients. How would you know this? If you looked at one thousand people, maybe by chance it would be the 1001st person who died. Or maybe you would have by chance two people die in the first thousand and then, by chance, none in the second thousand. Here you, again, one in one thousand dying (or 2/2000).

How many patients would you have to survey to make sure that 1 in 1000 was the correct number? Let's say you want to be 95% sure (p = 0.05). That's the question of sample size.
posted by dances_with_sneetches at 12:04 PM on November 12, 2014 [1 favorite]

Response by poster: bleep: I don't understand that. Aren't all user experience issues related to "abandonment"? If people aren't signing up?

leopard: I'm not assuming the client is dumb. I'm assuming they don't know the importance of statistical power and sample size. The calculator says 31k visitors per test for their scenario at an MDE of 20%. The smaller I make the MDE, the larger the visitor set is.
posted by Avenger50 at 12:09 PM on November 12, 2014

How much money do they lose if the conversion rate the test is currently showing holds true through all 31k visitors you want to test? Plus whatever they'd have to pay you to continue the experiment, if that's an issue? That's what you're fighting against, so argue in those terms. How much money could they earn from a reasonable best-case scenario and what are the chances of that? They're not paying you to make analogies, they're paying you to apply your expertise to the particulars of their business, so do that.
posted by acidic at 12:51 PM on November 12, 2014 [2 favorites]

Have you considered the possibility that it may be perfectly reasonable for them to want to cut bait on something that doesn't show a statistically significant impact in 1-2K visits, whether or not they fully understand the implications of sample size on predictive power?

They have a sunk cost in the experiment so far, and ongoing costs for continuing it, if only due to the opportunity cost of not being able to use available resources for other experiments that have a better chance of a larger payoff.

Given the current non-result, what are the chances that continuing this experience will produce evidence of a significant result with a return that will exceed their historical average return on such experiments? Do you have reason to believe that they have reached the point where they are unlikely to find optimizations that produce a significant result in 1-2K trials? Do you have reason to believe that the results observed in past experiments with 1K sample sizes were likely the result of chance.
posted by Good Brain at 5:20 PM on November 12, 2014 [1 favorite]

Perhaps the Cartoon Guide to Statistics can help.
posted by Captain Chesapeake at 8:11 PM on November 12, 2014

Sometimes you don't need a number. You just need to see that you're not immediately deluged with hate mail, and that's enough info combined with other considerations to make the decision.

But Khan Academy has some good short, understandable statistics videos you might steal something from.
posted by ctmf at 9:41 PM on November 12, 2014

I could be your client and agree with those who advise you to thread carefully. I totally see how the analogy approach could work with some people, but it would be easy to make me feel like you think I'm an idiot.

If you wanted to convince me to keep going, you should show me periods in the past where 2000 visitors also did the thing they do now, with the old design. So, if the problem is that the new design has less people sign up for a mailing list, and you tested on a Monday, find me another Monday where few people sign up for the mailing list (do check for holidays or other special events like a link from a high profile website resulting in tons of visitors but no signups), or give me a chart with the variation in mailing list signup conversions for the past year or so on the old design, so that I can see that it is possible that this new number is due to chance. What I'm saying is: keep your explanation relevant to my site.
posted by blub at 12:25 AM on November 13, 2014

Something unmentioned by the original poster -- what effect size are they expecting to be powered for?

At 2k (1-tailed, beta=.8, alpha=.95, p(occur)=.01), they are powered to detect and ~80% increase. So, even at 2k people, they know pretty well that the new page isn't twice as good. If they are expecting an improvement of 10% (from 1 -> 1.1% say), then they are underpowered.

For rare events like these (1% is rare!), I like to explain it in terms of heart attacks. Because heart attacks are rare, one has to watch a lot of people to make claims about what causes heart attacks. In this case, asking if the page increased conversion by 20% (for example) is like asking if 22/2000 is reliably different from 20/2000. OTOH, 310/30000 vs 300/30000 is reliably different! (That's the whole claim of the sample size calculator.)

As for them "spotting problems in the funnel", that suggests:

a. they have additional information (value per transaction) that should guide this, or
b. the human cognitive bias of seeing patterns where there isn't, or over-weighing evidence.
posted by gregglind at 12:41 PM on November 13, 2014

« Older What is the probable timeline for same-sex...   |   ISO: the portable Die Antwoord Newer »
This thread is closed to new comments.