# How big should my random sample be?

August 10, 2006 8:25 PM Subscribe

Appropriate sample size: How many records do I have to check to achieve a given level of confidence in a hypothesis?

Forgive the sloppy wording of this question - my understanding of statistical methods and terminology is poor at best.

I have two systems (New and Old) measuring the same thing, and I am checking the output of New against the output of Old. I am doing this to determine whether New is ready to replace Old.

Where New outputs the same value as Old, I don't test either value: I make the assumption that both systems measured correctly (unwarranted but unavoidable under current constraints).

Where the output of New is different from the output of Old, I run a third test by hand to determine which system is correct, New or Old. The hand test takes about 5 minutes of cross-checking which cannot be automated in the time I have available.

I don't wish to hand-test all 600-odd instances where New and Old fail to agree. What I would like to do is take a random sample of those instances and hand-test those to determine whether the New system measures more accurately than the Old system.

My question is, how big should my sample size be to test the hypothesis "New is no more accurate than Old" to, say, a 95% level of confidence?

Forgive the sloppy wording of this question - my understanding of statistical methods and terminology is poor at best.

I have two systems (New and Old) measuring the same thing, and I am checking the output of New against the output of Old. I am doing this to determine whether New is ready to replace Old.

Where New outputs the same value as Old, I don't test either value: I make the assumption that both systems measured correctly (unwarranted but unavoidable under current constraints).

Where the output of New is different from the output of Old, I run a third test by hand to determine which system is correct, New or Old. The hand test takes about 5 minutes of cross-checking which cannot be automated in the time I have available.

I don't wish to hand-test all 600-odd instances where New and Old fail to agree. What I would like to do is take a random sample of those instances and hand-test those to determine whether the New system measures more accurately than the Old system.

My question is, how big should my sample size be to test the hypothesis "New is no more accurate than Old" to, say, a 95% level of confidence?

I think in order for whatever method you're using to be accurate, from the 600 cases you describe, you'll need at least ten examples of New deciding incorrectly, and ten examples of Old deciding incorrectly. That means to be safe you'll probably need to pick at least 30(???) cases to examine.

I'm curious to know how exactly you're going to run this test -- like, what exactly, mathematically, will you do; or alternatively, what software are you going to use? Because I'm having troubling figuring out how your whole test is structured. I might be able to help more with more information; but I'm not a practicing statistician or anything, just an occasional statistics teacher.

posted by evinrude at 9:20 PM on August 10, 2006

I'm curious to know how exactly you're going to run this test -- like, what exactly, mathematically, will you do; or alternatively, what software are you going to use? Because I'm having troubling figuring out how your whole test is structured. I might be able to help more with more information; but I'm not a practicing statistician or anything, just an occasional statistics teacher.

posted by evinrude at 9:20 PM on August 10, 2006

So if they are no more accurate than one another in general, there should be a 50% chance of either on being more accurate on a given run. It will be easier to reject hypotheses, so if you want to go with the hypothesis you suggested (equal accuracy), actually test to see if New is more accurate than Old.

Measure, say, the number of times that New is more accurate than old and call it M. The hypothesis that you are trying to reject is that New beats Old x % of the time and that Old is better y%, where (x+y) = 1. You can sort of eyeball this part, and say that New is really better if it is better 60% of the time, say. The probability of getting exactly M News that are more accurate out of N trials is (N choose M) * (x)^M (y)^(N-M). "N choose M" is just N! / (M! (N-M)! ) and is also called the binomial coefficient. Depending on the results you get, it will take a different number of trials to determine the results, since if you see New outperforming old every time is more indicative of consistency than New outperforming old two thirds of the time.

As a subltety, when this is done in statistics (and this I only barely know), what you actually calculate for hypothesis testing is something called a "p-value". It's not just the probability that what you saw is consistent with your hypothesis, but that what you saw, or anything more extreme, is true. In this case, more extreme would actually involve seeing New beat Old less often, so you have to include all smaller values of M:

p-value = x^M * sum_(m = 1 to M) ( N choose m ) y^(N-m)

If you get a p-value below 0.05, you can say that your results are have only a 5% chance of occuring if the rule you set by x and y was accurate. Since y = 1-x, you should be able to plot the p-value in a spreadsheet or something as a function of x as you take more data. Sorry that there's no good value of N, but this should let you now when to stop.

By the way, you do the summing because it is unlikely that any one outcome will occur. If you flip a coin ten times, it's unlikely that it will actually show 5 heads and 5 tails, but very likely that it will show at least 5 heads.

posted by Schismatic at 9:31 PM on August 10, 2006

Measure, say, the number of times that New is more accurate than old and call it M. The hypothesis that you are trying to reject is that New beats Old x % of the time and that Old is better y%, where (x+y) = 1. You can sort of eyeball this part, and say that New is really better if it is better 60% of the time, say. The probability of getting exactly M News that are more accurate out of N trials is (N choose M) * (x)^M (y)^(N-M). "N choose M" is just N! / (M! (N-M)! ) and is also called the binomial coefficient. Depending on the results you get, it will take a different number of trials to determine the results, since if you see New outperforming old every time is more indicative of consistency than New outperforming old two thirds of the time.

As a subltety, when this is done in statistics (and this I only barely know), what you actually calculate for hypothesis testing is something called a "p-value". It's not just the probability that what you saw is consistent with your hypothesis, but that what you saw, or anything more extreme, is true. In this case, more extreme would actually involve seeing New beat Old less often, so you have to include all smaller values of M:

p-value = x^M * sum_(m = 1 to M) ( N choose m ) y^(N-m)

If you get a p-value below 0.05, you can say that your results are have only a 5% chance of occuring if the rule you set by x and y was accurate. Since y = 1-x, you should be able to plot the p-value in a spreadsheet or something as a function of x as you take more data. Sorry that there's no good value of N, but this should let you now when to stop.

By the way, you do the summing because it is unlikely that any one outcome will occur. If you flip a coin ten times, it's unlikely that it will actually show 5 heads and 5 tails, but very likely that it will show at least 5 heads.

posted by Schismatic at 9:31 PM on August 10, 2006

It depends on the standard deviation. In other words, if there is a lot of 'noise' where one of the tests is like 52% more likely to be right, you need to do a lot of tests. If New is 99% more likely to be right, then you need to do fewer tests.

posted by delmoi at 9:33 PM on August 10, 2006

posted by delmoi at 9:33 PM on August 10, 2006

Ignore the sentence about fixing values of x. As I said later, if you keep a plot of the p-value as a function of x while you're finding more values of New and Old, you'll be able to reach the point where only part that isn't very small is a region around the most likely value of x. As a perk, if New and Old aren't equivalent, this gives you an idea of how much better one is than the other.

posted by Schismatic at 9:35 PM on August 10, 2006

posted by Schismatic at 9:35 PM on August 10, 2006

Can you provide some more information about what you are measuring and how the measurements are taken and what they mean? Its hard to tell what type of statistical test would be appropriate here with the info you've provided. Confidence intervals are built using standard errors, which are built using standard deviation.

This aside, the historical answer to this question is 30. 30 has long been considered the number at which a sample is "sufficiently large" for accurate statistical testing, regardless of population size. Note that this is a historical convention and has no basis in actual statistics, but is generally accepted nonetheless.

Or, on preview, what evinrude and delmoi said.

posted by jtfowl0 at 9:39 PM on August 10, 2006

This aside, the historical answer to this question is 30. 30 has long been considered the number at which a sample is "sufficiently large" for accurate statistical testing, regardless of population size. Note that this is a historical convention and has no basis in actual statistics, but is generally accepted nonetheless.

Or, on preview, what evinrude and delmoi said.

posted by jtfowl0 at 9:39 PM on August 10, 2006

There's no way to say that they're the same ---that's the null hypothesis. Also, you can't know in advance how many trials you'll need (unless you have an estimate of the accuracy).

Here's a quick way of seeing if you can prove that they're different.

The probability of failure, p = #wrong / #trials.

This is a binomial distribution, so the variance (uncertainty in probability) is sigma = sqrt(p * (1-p) / #trials)

So you want to know, is the difference in estimated probabilities equal to zero (within the calculated uncertainties).

This difference is p1 - p2, and the uncertainty (via propagation of error) is sigma= sqrt(sigma1^2 + sigma2^2). You have to compare the difference to sigma to determine significance. Typically, the difference has to be greater than 2 times sigma to be considered significant. That happens to be about 95%.

On preview, delmoi's link is probably doing this automatically.

posted by Humanzee at 9:48 PM on August 10, 2006

Here's a quick way of seeing if you can prove that they're different.

The probability of failure, p = #wrong / #trials.

This is a binomial distribution, so the variance (uncertainty in probability) is sigma = sqrt(p * (1-p) / #trials)

So you want to know, is the difference in estimated probabilities equal to zero (within the calculated uncertainties).

This difference is p1 - p2, and the uncertainty (via propagation of error) is sigma= sqrt(sigma1^2 + sigma2^2). You have to compare the difference to sigma to determine significance. Typically, the difference has to be greater than 2 times sigma to be considered significant. That happens to be about 95%.

On preview, delmoi's link is probably doing this automatically.

posted by Humanzee at 9:48 PM on August 10, 2006

Humanzee might have it...assuming that a binomial distribution is appropriate. Does it matter

posted by jtfowl0 at 9:57 PM on August 10, 2006

*how*accurate or inaccurate the measurements are, or is a miss as good as a mile?posted by jtfowl0 at 9:57 PM on August 10, 2006

Response by poster: Hmm, I think I phrased the question very poorly. What I should have asked (I think) was:

Setting the hypothesis aside for the moment. It's true I picked the null hypothesis deliberately, but also arbitrarily. I might have started with any hypothesis.

For those of you who asked me for specifics on what I'm measuring, these next five paragraphs are for you.

In inventory management, there is a measurement known as weeks cover. In the simplest terms this is a number that tells you how long (in weeks) you have before a particular product is reduced to nil stock on hand. It's used, amongst other things, to determine if you are overstocked in a particular product.

Weeks cover is usually computed by calculating an average sales rate per week and dividing that number into your current stock on hand.

It sounds simple enough, but in practice there are complicating factors: over what period of weeks do you base the average? do you count weeks in which stock on hand was nil? do you take forecast sales into account? and a dozen other business-specific rules that interact in complex ways such that careful programming is required and the output of the program must be checked, which is what I'm doing.

So in effect, I ask the New and Old systems for the weeks cover of every product in our inventory, and compare them. Where they agree (which they do for about 13000 products), I assume that both got it right.

Where they differ (about 600 products), I examine the raw data used to compile weeks cover for that product to see if some obscure business rule is being ignored or brought into play where it shouldn't be, if the correct set of sales data is being averaged, if the underlying sales and stock data is correct, and so on. I try to figure out why the systems arrived at the answer they did and which one is correct.

posted by Ritchie at 10:46 PM on August 10, 2006

*Out of a total population of 600, how big does my random sample need to be in order that whatever I discover to be true of the sample I could infer to be true of the entire population (and be about 95% certain of it)*?Setting the hypothesis aside for the moment. It's true I picked the null hypothesis deliberately, but also arbitrarily. I might have started with any hypothesis.

For those of you who asked me for specifics on what I'm measuring, these next five paragraphs are for you.

In inventory management, there is a measurement known as weeks cover. In the simplest terms this is a number that tells you how long (in weeks) you have before a particular product is reduced to nil stock on hand. It's used, amongst other things, to determine if you are overstocked in a particular product.

Weeks cover is usually computed by calculating an average sales rate per week and dividing that number into your current stock on hand.

It sounds simple enough, but in practice there are complicating factors: over what period of weeks do you base the average? do you count weeks in which stock on hand was nil? do you take forecast sales into account? and a dozen other business-specific rules that interact in complex ways such that careful programming is required and the output of the program must be checked, which is what I'm doing.

So in effect, I ask the New and Old systems for the weeks cover of every product in our inventory, and compare them. Where they agree (which they do for about 13000 products), I assume that both got it right.

Where they differ (about 600 products), I examine the raw data used to compile weeks cover for that product to see if some obscure business rule is being ignored or brought into play where it shouldn't be, if the correct set of sales data is being averaged, if the underlying sales and stock data is correct, and so on. I try to figure out why the systems arrived at the answer they did and which one is correct.

posted by Ritchie at 10:46 PM on August 10, 2006

Since the two systems are different, one can infer that their failure rates will be different. They may be very different, or nearly identical. You can't know in advance how large your sample must be.

You can keep a running score though: keep calculating the difference between their estimated failure rates and the value of sigma; and keep comparing them. If at any point the difference in failure rates is statistically significant, you can declare that one is better than the other and stop ---you've just found how big your sample has to be.

One note: the calculation for uncertainty that I provided above assumes that you've had "many" failures and "many" successes for each test. In practice, "many" is a fuzzy thing. I tend to aspire to 10-20, but even around 5 isn't terrible (when you have fewer counts than "many", the uncertainty is somewhat higher than the simple square-root formula would give).

posted by Humanzee at 8:35 AM on August 11, 2006

You can keep a running score though: keep calculating the difference between their estimated failure rates and the value of sigma; and keep comparing them. If at any point the difference in failure rates is statistically significant, you can declare that one is better than the other and stop ---you've just found how big your sample has to be.

One note: the calculation for uncertainty that I provided above assumes that you've had "many" failures and "many" successes for each test. In practice, "many" is a fuzzy thing. I tend to aspire to 10-20, but even around 5 isn't terrible (when you have fewer counts than "many", the uncertainty is somewhat higher than the simple square-root formula would give).

posted by Humanzee at 8:35 AM on August 11, 2006

Just to add a little to the discussion, remember that there are two ways of thinking about error: There's error that comes from repeated measurements differing from one another, and there's error that comes from repeated measurements, which could be nearly identical, but different from the "true" value.

Without considering the mechanisms behind generating the value, one method you're using could be more variable from week to week in computing the same item, whereas one method could give you numbers that have the same deviation from the true value consistently, yet be further off from the true value.

So remember that reproducibility isn't necessary the thing to maximize. You have to decide whether you want to minimize the standard deviation of measurements(which would enable better long term forecasting) or if you want to minimize the difference from the true value(which would give you a better number for how much you have on hand at any given moment).

Now that I've thought about it, it would probably be easier to introduce a correction for a method that produced a smaller standard deviation.

posted by Mr. Gunn at 9:24 AM on August 11, 2006

Without considering the mechanisms behind generating the value, one method you're using could be more variable from week to week in computing the same item, whereas one method could give you numbers that have the same deviation from the true value consistently, yet be further off from the true value.

So remember that reproducibility isn't necessary the thing to maximize. You have to decide whether you want to minimize the standard deviation of measurements(which would enable better long term forecasting) or if you want to minimize the difference from the true value(which would give you a better number for how much you have on hand at any given moment).

Now that I've thought about it, it would probably be easier to introduce a correction for a method that produced a smaller standard deviation.

posted by Mr. Gunn at 9:24 AM on August 11, 2006

This thread is closed to new comments.

(20 - 1) / 20 = .95 or about an hour and a half of testing.

YMMV IANAMathGeek

posted by crunchyk9 at 8:58 PM on August 10, 2006