Some elementary statisticsDecember 8, 2008 3:14 PM   Subscribe

Elementary statistics. Is it a significant delta?

Let's say you are the president of a national fast food chain. You ask all store managers to try a test.

50% of the time, customers will be greeted "Welcome" instead of "Hello." Each cashier will alternate back and forth, customer by customer, making no logical choice about when to use either greeting, just distributing them 50/50.

At the end of the week, the results are in. You need to report the findings, but you're not sure if they are statistically significant.

10 million customers were served.
5 million heard "Welcome."
5 million heard "Hello."

Average purchase of customers who heard "welcome:" \$5.02
Average purchase of customers who heard "hello:" \$5.10

Number of "Welcome" customers who ordered an Apple Pie: 198,400
Number of "Hello" customers who ordered and Apple Pie: 200,345

Does "Hello" really raise average purchase size?
Does "Welcome" actually discourage buying an Apple Pie?

I'm not sure if this question has to do with sample size or not, because for the duration of this test, the sample size was 100%. Every customer who came into the chain was included in the experiment.

I guess my questions are:

Is 8 cents out of \$5.10 a statistically significant change, among a sample of 10 million people?

Is another 1945 pies significant?

What is your test's margin of error / confidence?

Could you have run this only at a few stores and gotten just as solid results? How big a sample did you actually need?

Bottom line: Should your chain do anything based on these results?

I need some basic formulas for evaluating data like these.
I am not great with math, nor familiar with statistics. I don't know what a sigma, standard deviation, or T-test is (although I have heard the terms). I got Cs in Algebra and Calculus. I am bad at Poker.

Hope me!
posted by scarabic to Science & Nature (15 answers total) 1 user marked this as a favorite

You need the standard deviation of the purchase amounts.
posted by If only I had a penguin... at 3:42 PM on December 8, 2008

It's impossible to be sure about statistical significance without the sample standard deviation of money spent (or pies bought) for both subgroups. The appropriate test would be a two-sample t-test. However, to answer a couple of your questions:

Is 8 cents out of \$5.10 a statistically significant change, among a sample of 10 million people?

Almost certainly. Statistical significance is based on the standard error, a quantity which contains sample size in the denominator. With a sample size as large as 10 million even a trivial difference will tend to be statistically significant.

Bottom line: Should your chain do anything based on these results?

Do you have any a priori reason to think that "Hello" actually inspires people to part with their money more effectively than "Welcome" does? Statistical significance, in the absence of a clear causal theory, is pretty pointless. Data mining of this sort can produce statistically significant results, but this still does not mean that these patterns are anything but coincidental. Theory is what structures our empirical results and helps us to frame which ones matter and which ones do not. Without a good reason to think that the greeting matters in the first place, I would not change a policy based on results such as those you describe.
posted by shadow vector at 3:45 PM on December 8, 2008 [1 favorite]

Sorry, I just saw your last sentence. The standard deviation is (not quite but you can think of it as) how far, on averate, purchases are from the mean purchase. If your mean is \$5.10 and the purchases are generally \$5.00, \$5.05, \$5.02, \$5.15, \$5.17 etc. etc. then most purchases are pretty close to the mean. Your purchases are 75 cents, 5 cents, 45 cents, \$43, 65 cents, \$156, then you have a bigger standard deviation.

If you have a very small standard deviation they might be significant. If you have a large standard deviation, they're probably not. Without the standard deviation, you don't have the information you need to actually calculate significance level.
posted by If only I had a penguin... at 3:46 PM on December 8, 2008

Average purchase of customers who heard "welcome:" \$5.02
Average purchase of customers who heard "hello:" \$5.10

Start by framing your "null" and "alternative" hypotheses. These are the two questions you want to test:

H0 : The mean ("average") purchase totals of these two groups are not statistically different
H1 : The mean totals are statistically different

You want to figure out what your purchase data suggest about the truth of either H0 or H1.

The most common test used for this is probably the Welch's t-test, which can test if two means are equal or not (given a couple assumptions that seem reasonable to apply, here).

You will need: the two group's means (\$5.02, \$5.10), the two group's standard deviations and variances (you'll calculate these), and the number of data points in both groups (5M, 5M).

All you do is plug these values into the t equation; that's your "statistic" or "test statistic", what you're calculating to do your test.

Then you look up the t value you get back against a t-table, lining it up against the "degrees of freedom" (see the Welch's test Wikipedia page). Or you have software do to this for you.

The t-table or software will tell you what is significant, i.e. which hypothesis you can suggest is true, based on the data.
posted by Blazecock Pileon at 4:09 PM on December 8, 2008

Could you have run this only at a few stores and gotten just as solid results? How big a sample did you actually need?

Good analogies are political polls or census taking.

Political parties can't afford to take everyone's opinion. So political opinions are measured from a small sample of "random" people in the larger population. Those "random" samples suggest or infer the political leanings of the larger population.

"Random" is in quotes because how polling is done, what questions are asked, and the criteria used to randomly selecting people to ask can be argued to introduce certain biases in the results you get.

For example, if you poll people who own land-line telephones, on average they will be older than someone picked at random from the entire country. So do their political opinions represent those of everyone in their country, or just those in their same age bracket? Good design mitigates some of these issues.

Mathematically, you don't need massive samples. Getting a larger sample will reduce your error, but there are diminishing returns. So you balance the error rate you're comfortable with against your budget for doing all of the extra work required.
posted by Blazecock Pileon at 4:35 PM on December 8, 2008

shadow vector: Statistical significance, in the absence of a clear causal theory, is pretty pointless.

Agreed. You need to have some idea of WHY "hello" might be different from "welcome." For example, perhaps "Welcome" is seen as a too-formal greeting in that particular culture. Otherwise, the math is meaningless, because correlation is not causation.
posted by desjardins at 5:39 PM on December 8, 2008

Okay conceptually I understand that the amount of variation in the data bears on the overall delta in the end. That makes sense. I will follow up on the resources here, thanks.

What about the apple pie test, though, where you either bought a pie or didn't? There doesn't seem to be a standard deviation in the same way there. The discrete event sometimes took place. That is all.

How do you handle that one? I think it may be more close to my actual application. I'm analyzing what people do on websites. Example:

1) We tried button X on the left and the right of the page, 50/50.
2) Left got 500 clicks, Right got 480 clicks. There were 40,000 pageviews.

Is the left better for clicks? Perhaps this "discrete" example is the same as the average purchase amount but I'm just missing how.

I agree with the point about having a causal theory with which to interpret the data. Sometimes, changes happen that are not easily explained. It's hard to understand clearly exactly what's going on as millions of people click madly through the web. In those cases, you sometimes have the choice to trust the data blind, or do nothing because you can't explain the data. Being able to validate that the data is statistically significant would help me decide whether to ignore it.
posted by scarabic at 6:35 PM on December 8, 2008

Okay, that's a reasonably different application. And one in which I could imagine that the placement of a button has an impact on clickthrough.

Here, since you have a discrete variable, we'll use the two-sample test of a proportion. You have a 500/20000 (0.025) success rate on the left and a 480/20000 (0.024) success rate on the right. The difference in these proportions is not statistically significant. For these two proportions to be statistically different you would need about 10 times as much data.

Let me elaborate on what this is doing. We will first calculate a standard error by taking our sample standard deviation (for a proportion, there is still a standard deviation which equals sqrt(p*(1-p))) and dividing it by the square root of the sample size. This, essentially, is a measure of how uncertain we are about our estimate of the true sample mean. This gets compared to the difference across the two groups (0.001 here) and converted into something called a test statistic. We then plug the test statistic into the correct distribution and get a p-value. This describes the probability that we would have gotten the sample we did if the true population proportions were identical. If the p-value is sufficiently low then we can conclude that the underlying population proportions are different. This p-value is 0.5177, so we cannot conclude that. More data will reduce the sample size and the standard error, meaning that finer distinctions can be teased apart. If you had 10 times as many observations, but the same proportions, then this difference would be statistically significant at the 0.05 level.
posted by shadow vector at 7:09 PM on December 8, 2008

For apple pie, you would convert those raw numbers into proportions.

0.03968 of one side ordered apple pie.
0.04007 of the other group ordered apple pie.'

This difference is discernible or significant at a 95% level.

But.

This doesn't mean that the difference matters. "Statistically significant" means only "can be distinguished from exactly zero." It has ABSOLUTELY NOTHING WHATSOEVER to do with whether the thing being observed is substantively important.

This is especially important with a ludicrous sample like 10000000. With samples that large, essentially any difference will be statistically significant. Including all sorts of things that don't make a damn bit of difference in the real world.
posted by ROU_Xenophobe at 7:10 PM on December 8, 2008

I read the clarification you wrote and it sounds like you're interested in comparing two proportions (500 clicks / 20,000 page views and 480 clicks / 20,000 page views). You can compare two proportions using a two proportion Z-test like the one found on this page.
posted by eisenkr at 7:16 PM on December 8, 2008

ROU_Xenophobe destroys the efficacy of 50% of all social science research with one cogent comment.
posted by Crotalus at 10:40 PM on December 8, 2008

Heya shadow vector - it sounds like you are getting at the heart of my question but I can't follow your answer. I need a really dumbed-down and step by step version of the stuff following

>>"Let me elaborate on what this is doing."

eisenkr - I love that page. Just plug in the numbers and get a yes/no. What's it doing?
posted by scarabic at 9:37 AM on December 9, 2008

eisenkr - why do I report 500 clicks / 20,000 pageviews instead of 40,000?

It's not like there were two distinct groups of 20,000, and in one 500 clicked, and in the other 480 clicked. There's one big group. Is it valid to treat this like there are two groups?
posted by scarabic at 9:41 AM on December 9, 2008

Hi scarabic,

In your apple pie / click-or-not example, you can get some idea about the uncertainty from "counting statistics" or "Poisson statistics." If you count N "independent" events (the classic example is radioactive decays), the standard deviation (or "sigma", or "σ") is √N. So if you have 500±22 clicks on the left and 480±22 clicks on the right, that's a one-sigma difference; you expect a meaningless one-sigma difference about one time in three (that is, 66% confidence), so it doesn't mean much. In fact, if your choice to display on the left or on the right was random and unbiased, you'd expect 20,000±140 views in each group, which dilutes your difference a little more. Somebody above came up with p = 0.5 (i.e. 50% confidence) which sounds about right.

The counting error on 200,000 apple pies is about ±450, so a difference of 1900 pies is pretty big. I guess if you add the independent errors "in quadrature" the error on the difference is ±√( 450^2 + 450^2 ) = 600, so 1900 pies is different from zero at the 3σ = one-in-a-thousand level. Here you'd definitely have to include the counting error on the 5,000,000±2,000 "hello"s, since you don't trust your cashiers' self-control, but the difference is still big enough you might reasonably expect to see it again. If you hadn't just made the numbers up, that is.

There's a subtle point about confidence intervals that eludes some real live scientists (but not ROU_Xenophobe). Usually in biological and medical studies, and in the social sciences, "statistically significant" is used interchangeably with "95% confidence" or "2σ difference." A five percent wrong-answer rate is one wrong result out of twenty. So if you take a big pile of data and look for 60 different nonsense correlations, like whether your clickthrough rate depends on the number of visible sunspots or on how long till the next bus, you'll come up with about three "significant" correlations. So if your method is to look until you find something interesting, you'll eventually find something "significant," but that doesn't mean it will happen again.
posted by fantabulous timewaster at 1:53 PM on December 10, 2008

I was thinking about counting statistics on AskMeFi earlier today, writing about how many hard drives you'd have to buy before you could say whether one manufacturer is better than another. That's 2.0±1.4 essays about counting statistics in a day, not significantly more than the 0±0 essays about counting statistics I wrote every other day this month.
posted by fantabulous timewaster at 2:04 PM on December 10, 2008

« Older Rising Cliche   |   Ahhhhh help me find this gojira still! Newer »
This thread is closed to new comments.