# Let me not neglect that base rate....July 11, 2018 2:08 PM   Subscribe

Estimating base rates from small sample sizes -- how large a sample do I need to reliably (confidently?) establish the base rate?

I'm looking for a quick guide to how to estimate a base rate. Specifically, in the case where there are few data from which to calculate a base rate, what are the error bars around the base rate estimate? I once found a table that reported 95% confidence error bars on base rate estimates by sample size. Of course I can't find it now.

Does the confidence of the base rate estimate depend only on the sample size? Or does the 'true' (population) base rate make a difference as well? That is, for low base rates (say, <10%), do you need larger samples to confidently measure base rate?

I'm having a hard time finding a relatively low level guide and explanation for this. Most of what I find talks about the consquences of base rate neglect... I don't want to neglect base rates, but our data sets are generally small ...
posted by bumpkin to Science & Nature (7 answers total) 2 users marked this as a favorite

I'm likely to get myself into trouble because this is one of those things like where psychologists have umpty zillion different kinds of t-tests but:

I assume you have a DV that is either directly binary (it happened or it didn't), or you can easily transform it into "it happened" versus "it didn't." Likewise, you have some primary variable that's your "treatment" variable.

The base rate is just the proportion of cases where, looking only at the cases where either you or nature didn't impose the "treatment," "it happened." If you're in a frequentist world, you can put a confidence interval around the same way as any other sample mean, xbar+/- t_025 SEs. --BUT-- you should avoid the special formula for proportions as it assumes a large N.

Does the confidence of the base rate estimate depend only on the sample size? Or does the 'true' (population) base rate make a difference as well? That is, for low base rates (say, under 10%), do you need larger samples to confidently measure base rate?

The proportion itself matters too, but it's the sample value that matters; you never get to know what the population value is. Your intuition here is backwards -- confidence intervals are widest for proportions at 0.5 and get narrower as you approach zero or one.

Different fields and subfields have different notions of what best practices are and you should inquire with your broader colleagues as to what *your* standard of practice is.
posted by GCU Sweet and Full of Grace at 3:00 PM on July 11, 2018 [1 favorite]

Does the confidence of the base rate estimate depend only on the sample size?
No; of course not. It must also depend on the population size, otherwise you don’t know how well you’re covering the population. In the limit, sample size equals population size, and your only errors are in measurement.

If you can tell us a bit about your general (sub)field or what your actual data and population looks like we may be able to help better. But as it stands, ‘base rate’ is just another statistic, so all we can do is point you to general discussions of confidence intervals.
As you can see, your methods should depend on the nature of the data, what assumptions you are willing to make about the nature of the true distribution, the sample size, the population size, the error in measurement, etc.
(I am a mathematician and probabilist by training, so I am exactly the sort who will look for and call out the type of "trouble" that GCU indicates above. There may be plenty of reasonable things to do here, but there is certainly not one reasonable thing to do in general :)
posted by SaltySalticid at 3:07 PM on July 11, 2018 [1 favorite]

In the meantime, I've kept digging.... and I can ask a better version of the question.

First, I've been sloppy with using the word 'confidence'. I want to calculate the margin of error, given a confidence (say 95%), and given a sample size.

Yes it's binary. Like a coin flip -- either heads or tails.

The population is arbitrarily large (what I've read suggests that this stops making a difference over, say N=100000). What I'm trying to determine is something like the odds of tails with an unfair coin, or the true underlying odds of a slot machine.

GCU_Sweet... yeah, I just discovered that and it makes sense now that I'm (beginning) to understand how to think about it.

SaltySalticid: I'm trying to estimate frequency.

I understand that the margin of error will depend on the underlying or true rate/frequency; the confidence interval I require and the sample size. I have found a calculator. My question is, I guess now: can you point me to clear, fairly low level (as you can see, I'm pretty uneducated on this) explanations or tutorials or guides or suchlike?
posted by bumpkin at 3:33 PM on July 11, 2018 [1 favorite]

I want to calculate the margin of error, given a confidence (say 95%)

In that context a margin of error is a confidence interval; literally, absolutely, exactly and by definition the same thing.

Trying to be gentle: Why are you doing this? What's the goal that you'll serve by spending time and money doing this? I ask because, while I will happily admit that my field doesn't usually worry too much about this, it's hard to see why getting a particularly exact estimate of the base rate would be important. You have a base rate from your data and can use the bottom end of 90/95/99 percent confidence interval around it as the worst credible case to determine how large your sample size would have to be to stand a reasonable chance of uncovering an effect of a certain size, or to determine how large a treatment effect you would have to see to be able to statistically discern it. I am a little concerned that time and money spent better estimating a base rate would be much better spent just increasing your sample size overall.

The only circumstance I can think of where a margin of error would be different than a confidence interval you put around xbar is that the margins of error that are reported for poll proportions are almost always the margins of error for 50\%, not the actual margin of error for a specific sample proportion. This is fine as the 50\% mark is the worst case.
posted by GCU Sweet and Full of Grace at 4:00 PM on July 11, 2018

What kind of rate? If you are talking about a rate of occurrences per time, the most common analysis is via the Poisson distribution.

If you are talking about a rate that is a percentage, e.g. what percent of patients die of a disease, the it's a binary distribution.
posted by SemiSalt at 4:49 PM on July 11, 2018 [1 favorite]

It must also depend on the population size, otherwise you don’t know how well you’re covering the population.

Wait, really? I thought for reasonably large populations and reasonably large samples, measuring the rate of non-rare events doesn't really depend on coverage. E.g., most familiarly polling: Asking 1000 people about who they are voting for for president gives you about +/- 3%. To get the same margin of error for a congressional district--which is much smaller--you also need to ask 1000 people, right?

(Same with clinical trials, which is close to my professional life. You don't base sample size on patient population so much as on expected effect size.)
posted by mark k at 8:56 PM on July 11, 2018

Yes, population size only matters for extraordinarily small populations. A population-of-interest of about 10000 might as well be infinite.
posted by GCU Sweet and Full of Grace at 4:09 AM on July 12, 2018 [1 favorite]

« Older Why do my cookies flatten?   |   What gift can I give a family with a possibly... Newer »