Statistical relevance test for A/B testing?
April 8, 2009 1:20 PM   Subscribe

StatisticsFilter: I want to implement A/B testing on my website. What is the relevance test to determine at what point I'm getting valid results?

I'm familiar with confidence intervals & calculate them all the time, but I don't think confidence intervals are what I need. (Confidence Intervals are the one statistics trick I know and I know I way over-use it).

If I wanted to calculate average dollars spent with offer A vs offer B - all other things being equal (such as clickthroughs & purchases), I'd use confidence intervals.

But if I simply want to calculate clicks, what is the proper test?

I recently ran a few different headlines for an email blast, each going to about 1,700 people. I assumed 1700 people would be a good representative sample. (We have an opt-in mailing list of well over 10,000.)

Let's take three headlines, A, B & C.

Headline A: 1727 recipients, 484 opens, 82 clicks
Headline B: 1725 recipients, 565 opens, 121 clicks
Headline C: 1718 recipients, 558 opens, 100 clicks

My question is:

* Next time, how can I calculate how big a test I need to run?

* How do I calculate the validity after the test has run - how do I know for sure that headline A is better than headline B?

* Brownie points if you can explain how to do this in Excel or Gnumeric.

I suspect I may need to perform an ANOVA, but I'm not sure.

PS - I have the excellent book "Statistics for people who think they hate Statistics: Excel Edition" so I can work my way through the proper chapter once I know which test of relevance I need, I just need to get un-confused about which test to run, the section on which test to choose is a bit vague, or at least I don't know which test to choose for this situation.
posted by Muffy to Work & Money (6 answers total) 3 users marked this as a favorite
 
Response by poster: Oh, I'm aware of Google Website Optimizer, but I took the tour & it looks like they give you only very general results, and I want to be able to use this across media, such as print ads, banner ads, and so forth, things Google Website Optimizer can't help me with.
posted by Muffy at 1:36 PM on April 8, 2009


Best answer: Sample size and confidence interval calculators
posted by desjardins at 2:22 PM on April 8, 2009


Best answer: I would use the chi-square test since you have a nominal variable (A vs. B). Plugging your data into the example text from the wikipedia article:

For example, to test the hypothesis that a random sample of 1700 people has been drawn from a population in which [clicks on A] and [clicks on B] are equal in frequency, the observed number of clicks would be compared to the theoretical frequencies of 500 [clicks on A] and 500 [clicks on B].

In other words, add up your total of clicks on A and B and divide by two.

Here's a Google search for chi square Excel. There's even videos on how to do it.
posted by desjardins at 2:31 PM on April 8, 2009


You're looking at a Poisson process, since every viewer is deciding whether or not to click your link independently. Poisson statistics are great, because the standard deviation is just the square root of the number of clicks (in this case). So between cases B and C, you have ~ 100 clicks, sigma ~ 10. The results for B and C differ by twice that, so I'd say you have a Statistically Significant Result. Not too complicated.

To be even more accurate, you really have a binomial distribution, but it's well-approximated by the Poisson for large numbers of trials with a low probability of "success" (clicks), i.e., exactly this situation.
posted by kiltedtaco at 6:48 PM on April 8, 2009


Response by poster: Chi Squared looks promising.
O     E     D	  D^2   D^2/E
484   536   52    2704	5.0
565   536   29    841	1.6
558   536   22    484	0.9
            Sum D^2/E   7.5
        ChiDist (7.5,3) 0.057
O = Observed
E = Expected (average)
D = Difference (O-E)
D^2 = Difference squared
D^2/E = Difference Squared / Expected

Looking it up on the Chi Squared table (in the Statistics/Excel book) or using the ChiDist function using 3 degrees of freedom for the 3 rows, I get a level of confidence just below 95%.

That sounds like a good approach.

I'll try Poisson next.
posted by Muffy at 8:27 PM on April 8, 2009


Response by poster: I couldn't find anything in the book on Poisson Process, but it does give a handful of other possible relevance tests.

It does say, though, that Chi Squared is probably the only test you will need for this type of scenario, except under specific conditions - too many/few variables, bad sample size, etc.
posted by Muffy at 3:29 PM on April 9, 2009


« Older My hairdresser has HIV.   |   Eating and Body Image Newer »
This thread is closed to new comments.