Statistical relevance test for A/B testing?
April 8, 2009 1:20 PM Subscribe
StatisticsFilter: I want to implement A/B testing on my website. What is the relevance test to determine at what point I'm getting valid results?
I'm familiar with confidence intervals & calculate them all the time, but I don't think confidence intervals are what I need. (Confidence Intervals are the one statistics trick I know and I know I way over-use it).
If I wanted to calculate average dollars spent with offer A vs offer B - all other things being equal (such as clickthroughs & purchases), I'd use confidence intervals.
But if I simply want to calculate clicks, what is the proper test?
I recently ran a few different headlines for an email blast, each going to about 1,700 people. I assumed 1700 people would be a good representative sample. (We have an opt-in mailing list of well over 10,000.)
Let's take three headlines, A, B & C.
Headline A: 1727 recipients, 484 opens, 82 clicks
Headline B: 1725 recipients, 565 opens, 121 clicks
Headline C: 1718 recipients, 558 opens, 100 clicks
My question is:
* Next time, how can I calculate how big a test I need to run?
* How do I calculate the validity after the test has run - how do I know for sure that headline A is better than headline B?
* Brownie points if you can explain how to do this in Excel or Gnumeric.
I suspect I may need to perform an ANOVA, but I'm not sure.
PS - I have the excellent book "Statistics for people who think they hate Statistics: Excel Edition" so I can work my way through the proper chapter once I know which test of relevance I need, I just need to get un-confused about which test to run, the section on which test to choose is a bit vague, or at least I don't know which test to choose for this situation.
I'm familiar with confidence intervals & calculate them all the time, but I don't think confidence intervals are what I need. (Confidence Intervals are the one statistics trick I know and I know I way over-use it).
If I wanted to calculate average dollars spent with offer A vs offer B - all other things being equal (such as clickthroughs & purchases), I'd use confidence intervals.
But if I simply want to calculate clicks, what is the proper test?
I recently ran a few different headlines for an email blast, each going to about 1,700 people. I assumed 1700 people would be a good representative sample. (We have an opt-in mailing list of well over 10,000.)
Let's take three headlines, A, B & C.
Headline A: 1727 recipients, 484 opens, 82 clicks
Headline B: 1725 recipients, 565 opens, 121 clicks
Headline C: 1718 recipients, 558 opens, 100 clicks
My question is:
* Next time, how can I calculate how big a test I need to run?
* How do I calculate the validity after the test has run - how do I know for sure that headline A is better than headline B?
* Brownie points if you can explain how to do this in Excel or Gnumeric.
I suspect I may need to perform an ANOVA, but I'm not sure.
PS - I have the excellent book "Statistics for people who think they hate Statistics: Excel Edition" so I can work my way through the proper chapter once I know which test of relevance I need, I just need to get un-confused about which test to run, the section on which test to choose is a bit vague, or at least I don't know which test to choose for this situation.
Best answer: Sample size and confidence interval calculators
posted by desjardins at 2:22 PM on April 8, 2009
posted by desjardins at 2:22 PM on April 8, 2009
Best answer: I would use the chi-square test since you have a nominal variable (A vs. B). Plugging your data into the example text from the wikipedia article:
For example, to test the hypothesis that a random sample of 1700 people has been drawn from a population in which [clicks on A] and [clicks on B] are equal in frequency, the observed number of clicks would be compared to the theoretical frequencies of 500 [clicks on A] and 500 [clicks on B].
In other words, add up your total of clicks on A and B and divide by two.
Here's a Google search for chi square Excel. There's even videos on how to do it.
posted by desjardins at 2:31 PM on April 8, 2009
For example, to test the hypothesis that a random sample of 1700 people has been drawn from a population in which [clicks on A] and [clicks on B] are equal in frequency, the observed number of clicks would be compared to the theoretical frequencies of 500 [clicks on A] and 500 [clicks on B].
In other words, add up your total of clicks on A and B and divide by two.
Here's a Google search for chi square Excel. There's even videos on how to do it.
posted by desjardins at 2:31 PM on April 8, 2009
You're looking at a Poisson process, since every viewer is deciding whether or not to click your link independently. Poisson statistics are great, because the standard deviation is just the square root of the number of clicks (in this case). So between cases B and C, you have ~ 100 clicks, sigma ~ 10. The results for B and C differ by twice that, so I'd say you have a Statistically Significant Result. Not too complicated.
To be even more accurate, you really have a binomial distribution, but it's well-approximated by the Poisson for large numbers of trials with a low probability of "success" (clicks), i.e., exactly this situation.
posted by kiltedtaco at 6:48 PM on April 8, 2009
To be even more accurate, you really have a binomial distribution, but it's well-approximated by the Poisson for large numbers of trials with a low probability of "success" (clicks), i.e., exactly this situation.
posted by kiltedtaco at 6:48 PM on April 8, 2009
Response by poster: Chi Squared looks promising.
E = Expected (average)
D = Difference (O-E)
D^2 = Difference squared
D^2/E = Difference Squared / Expected
Looking it up on the Chi Squared table (in the Statistics/Excel book) or using the ChiDist function using 3 degrees of freedom for the 3 rows, I get a level of confidence just below 95%.
That sounds like a good approach.
I'll try Poisson next.
posted by Muffy at 8:27 PM on April 8, 2009
O E D D^2 D^2/E 484 536 52 2704 5.0 565 536 29 841 1.6 558 536 22 484 0.9 Sum D^2/E 7.5 ChiDist (7.5,3) 0.057O = Observed
E = Expected (average)
D = Difference (O-E)
D^2 = Difference squared
D^2/E = Difference Squared / Expected
Looking it up on the Chi Squared table (in the Statistics/Excel book) or using the ChiDist function using 3 degrees of freedom for the 3 rows, I get a level of confidence just below 95%.
That sounds like a good approach.
I'll try Poisson next.
posted by Muffy at 8:27 PM on April 8, 2009
Response by poster: I couldn't find anything in the book on Poisson Process, but it does give a handful of other possible relevance tests.
It does say, though, that Chi Squared is probably the only test you will need for this type of scenario, except under specific conditions - too many/few variables, bad sample size, etc.
posted by Muffy at 3:29 PM on April 9, 2009
It does say, though, that Chi Squared is probably the only test you will need for this type of scenario, except under specific conditions - too many/few variables, bad sample size, etc.
posted by Muffy at 3:29 PM on April 9, 2009
This thread is closed to new comments.
posted by Muffy at 1:36 PM on April 8, 2009