# Help me sample accurately!

January 14, 2010 9:01 AM Subscribe

Please provide me with an *effective*, and statistically sound, table/chart listing suggested sample size guidance for associated population size

I remember learning in statistics class that there was a published/pre-defined group of tables with a suggested number of samples you could take, depending on your total population size, that would provide you with the *best* chances of receiving accurate results without having to manual testing the entire population. ("T" or "Z" tables maybe? I might be mistaking these with another set of tables from stat class :)

For example, if my memory serves me correct, the *best* sample size for gaining the average number of males to females in America could be obtained by taking a random sample of only a few thousand citizens (under 5,000) across the country.

Thanks in advance fellow mi-fi math geniuses!

I remember learning in statistics class that there was a published/pre-defined group of tables with a suggested number of samples you could take, depending on your total population size, that would provide you with the *best* chances of receiving accurate results without having to manual testing the entire population. ("T" or "Z" tables maybe? I might be mistaking these with another set of tables from stat class :)

For example, if my memory serves me correct, the *best* sample size for gaining the average number of males to females in America could be obtained by taking a random sample of only a few thousand citizens (under 5,000) across the country.

Thanks in advance fellow mi-fi math geniuses!

I think you're misremembering some things.

Z tables are used to find the probability that a normally-distributed random variable will be greater than some value you care about. They're called Z tables because they rely on the Z statistic ( (mu-X)/sigma) to standardize the many possible normal distributions. In a sampling context, you could use a Z table and a known population to find the probability that a random sample from that population will have some quantitative characteristic you care about. If I sample 50 random adult Americans, what's the probability their average weight will exceed 250 pounds?

You use a t table in the reverse situation -- if you know sample characteristics and want to reason about the population they came from. T is similar to the normal distribution but takes into account that you don't know what the population sigma is; you only have the best-guess sample standard deviation.

In general, for most statistical purposes you can assume the population is infinite. If your population is very small (ie 500), you can apply a finite population correction, which will compress your standard errors. Like, if you took a random sample of US states.

But.

First, you'd need to think about what your population really is. Is it the set of realized, actual US states as they stand on that day? Or is the object of theoretical interest really the set of all possible US states, which will arguably be infinite?

Second, finite-population corrections are pretty small in any case. A sample of 1500 is about as good a sample of 10000 as it is 100000 or 100000000 or a million billion squillion brazillion.

Second, the answers to the questions you're asking go more or less like this:

(1) The sample size that will get you the best chance of accurate results is the largest one you can afford to gather. The tradeoff is only cost. A sample of 5000 isn'[t better than one with 100000 in any way at all except that it's cheaper.*

(2) For small populations, just take a census (count everything).

(3) For large populations, there's relatively little value added in samples beyond about 1500.

(4) UNLESS you also want to make inferences about subpopulations. If you want to make inferences about the opinions of adult Americans, you want 1500 adult Americans. If you want to make inferences about the opinions of adult American men and women, you want a sample big enough that you can expect about 1500 men and 1500 women. If you want to make good inferences about the opinions of black Americans and white Americans, you want a sample big enough that you can bet it will include at least 1500 black Americans, which means a sample of about 10000.

*This isn't quite true; very large samples also mean that you have to put more work into interpreting your results, because even very trifling effects can be statistically significant.

posted by ROU_Xenophobe at 9:34 AM on January 14, 2010

Z tables are used to find the probability that a normally-distributed random variable will be greater than some value you care about. They're called Z tables because they rely on the Z statistic ( (mu-X)/sigma) to standardize the many possible normal distributions. In a sampling context, you could use a Z table and a known population to find the probability that a random sample from that population will have some quantitative characteristic you care about. If I sample 50 random adult Americans, what's the probability their average weight will exceed 250 pounds?

You use a t table in the reverse situation -- if you know sample characteristics and want to reason about the population they came from. T is similar to the normal distribution but takes into account that you don't know what the population sigma is; you only have the best-guess sample standard deviation.

In general, for most statistical purposes you can assume the population is infinite. If your population is very small (ie 500), you can apply a finite population correction, which will compress your standard errors. Like, if you took a random sample of US states.

But.

First, you'd need to think about what your population really is. Is it the set of realized, actual US states as they stand on that day? Or is the object of theoretical interest really the set of all possible US states, which will arguably be infinite?

Second, finite-population corrections are pretty small in any case. A sample of 1500 is about as good a sample of 10000 as it is 100000 or 100000000 or a million billion squillion brazillion.

Second, the answers to the questions you're asking go more or less like this:

(1) The sample size that will get you the best chance of accurate results is the largest one you can afford to gather. The tradeoff is only cost. A sample of 5000 isn'[t better than one with 100000 in any way at all except that it's cheaper.*

(2) For small populations, just take a census (count everything).

(3) For large populations, there's relatively little value added in samples beyond about 1500.

(4) UNLESS you also want to make inferences about subpopulations. If you want to make inferences about the opinions of adult Americans, you want 1500 adult Americans. If you want to make inferences about the opinions of adult American men and women, you want a sample big enough that you can expect about 1500 men and 1500 women. If you want to make good inferences about the opinions of black Americans and white Americans, you want a sample big enough that you can bet it will include at least 1500 black Americans, which means a sample of about 10000.

*This isn't quite true; very large samples also mean that you have to put more work into interpreting your results, because even very trifling effects can be statistically significant.

posted by ROU_Xenophobe at 9:34 AM on January 14, 2010

Response by poster: If it helps, my intended application right now would be some formula's I'm testing in MS Excel. I have about 50 columns by 3000 rows of data and would like to make sure my formula's (there are 3,000 cells of formulas) are calculating correctly. I need to select a sample of the 3,000 formulas to manually to the math on as I'd rather not manually calculate the entire lot myself :)

posted by thankyoumuchly at 10:05 AM on January 14, 2010

posted by thankyoumuchly at 10:05 AM on January 14, 2010

How many actually independent formulas are there?

posted by ROU_Xenophobe at 10:17 AM on January 14, 2010

posted by ROU_Xenophobe at 10:17 AM on January 14, 2010

Response by poster: 1 unique formula with variables that build on the previous formula's result, 3,000 virtual formula's. The data across the 3,000 row x 50 column matrix is extremely varied, thus my want for a sampling solution to increase my realiance on the formulas as almost perfectly accurate :)

posted by thankyoumuchly at 10:39 AM on January 14, 2010

posted by thankyoumuchly at 10:39 AM on January 14, 2010

Response by poster: @desjardins - The "Sample size calculator" link you posted is exactly what I'm looking for! Thank you.

What's the justification of the math behind it?

posted by thankyoumuchly at 10:57 AM on January 14, 2010

What's the justification of the math behind it?

posted by thankyoumuchly at 10:57 AM on January 14, 2010

@thankyoumuchly the justification of the mathematics behind the sample size calculator can be found in the wikipedia link I posted in the first comment.

posted by handee at 10:59 AM on January 14, 2010

posted by handee at 10:59 AM on January 14, 2010

Maybe someone who does QA will know otherwise, but this doesn't sound at all like a sampling problem to me.

If the thing you're doing isn't vital, then just take a few convenient test input values, figure out their output values, and check that way. If the formula returns the correct output values, then it does.

If the thing you're doing is actually vital in some way and you want to feel certain about it, then check each cell manually. Note that this will mean that, really, the spreadsheet is checking your math more than the other way around.

Or, if you just want to feel more certain, break the (presumably big and complex) formula down into subformulas in several cells, and then aggregate in whatever the appropriate way is across those intermediate cells. The advantage here would be that you'd be better able to see "crazy" values in a column that's X/Y or A+B than you would a big complex equation. Then check each part separately along the lines of the first thing, and check the aggregation function separately, using convenient input values.

posted by ROU_Xenophobe at 11:02 AM on January 14, 2010

If the thing you're doing isn't vital, then just take a few convenient test input values, figure out their output values, and check that way. If the formula returns the correct output values, then it does.

If the thing you're doing is actually vital in some way and you want to feel certain about it, then check each cell manually. Note that this will mean that, really, the spreadsheet is checking your math more than the other way around.

Or, if you just want to feel more certain, break the (presumably big and complex) formula down into subformulas in several cells, and then aggregate in whatever the appropriate way is across those intermediate cells. The advantage here would be that you'd be better able to see "crazy" values in a column that's X/Y or A+B than you would a big complex equation. Then check each part separately along the lines of the first thing, and check the aggregation function separately, using convenient input values.

posted by ROU_Xenophobe at 11:02 AM on January 14, 2010

thankyoumuchly, just because there exists a formula called "sample size calculator" does not mean that it is a useful thing for you to use, or that sampling from your data is a useful way to check the accuracy of your formula. But whatever you do almost certainly won't pick my pocket or break my leg, so whatever.

posted by ROU_Xenophobe at 11:06 AM on January 14, 2010

posted by ROU_Xenophobe at 11:06 AM on January 14, 2010

Reading your further comments, I'm with ROU_Xenophobe on the appropriateness of the sampling here. You're not interested in extrapolating from a population, you're interested in verifying the correctness of some code. Break it down into small pieces and test each bit. Work out what the "edge cases" are, and test those.

posted by handee at 11:14 AM on January 14, 2010

posted by handee at 11:14 AM on January 14, 2010

This thread is closed to new comments.

posted by handee at 9:22 AM on January 14, 2010