Extremely basic statistical question
February 24, 2015 11:03 AM Subscribe

My state has 709 zip codes. I want to draw reasonably well-grounded conclusions about the entire population of my state (having to do with insurance coverage and ACA subsidies, if you're interested). How many zip codes should I sample?

I can get data for each zip code, although this is time-consuming enough such that I don't want to do it for all 709 zip codes. I want to draw conclusions about the entire state, based on the entire mass of the data I get from several zip codes that I assemble. Question: how many zip codes do I need to get data from, in order to have a reasonable degree of confidence?

Please let me know in the comments if you need additional information.

posted by Mr. Justice to Science & Nature (18 answers total) 3 users marked this as a favorite

It depends entirely on what exactly you want to do and how precise you want to be. The answer might range anywhere from "Looking at zip code data is just not a valid way to answer that question" to "20-odd if you're willing to make very broad and imprecise statements" to "Around 250 if you want to make reasonably precise statements."

There aren't canned data for counties?
posted by ROU_Xenophobe at 11:12 AM on February 24, 2015 [2 favorites]

(er, that was assuming zip codes are roughly equal in population, which they probably aren't)
posted by ROU_Xenophobe at 11:12 AM on February 24, 2015 [1 favorite]

My stats professor would say: at least 30.
posted by ThePinkSuperhero at 11:17 AM on February 24, 2015

Yeah; there's no "one size fits all" answer here. Basically, you want to make sure that you are sampling enough zip codes such that your sample as closely as possible mirrors the population of the state. So, for example, 30 may be enough, but you better make sure that those 30 sample zip codes are carefully chosen.

There are tons and tons of factors that may affect insurance coverage from average household income to education level to rural vs. urban environments to political leanings (?) to etc. etc. etc. For that reason, you need to make sure that you have a sample that accurately reflects the diversity of your state. This diversity can take many forms and a good sample will reflect as many of those diversity categories as possible. The flip side of this is that by MAKING these choices, you are likely to also exclude some forms of diversity.

So, almost as important as choosing a diverse sample is understanding the limitations of your sample and how those limitations affect the conclusions you come to. This isn't necessarily a "flaw"; it's just something that you need to be aware of.
posted by Betelgeuse at 11:23 AM on February 24, 2015 [3 favorites]

It's hard to answer this without knowing what exactly you are asking. There are calculators that can help you answer this question, and I found this site has a pretty clear description, but it might be too simplified for your needs - your area codes will likely have many differences, and you might be asking multiple questions.
posted by fermezporte at 11:27 AM on February 24, 2015

Some zip codes will be full of wealthy people, some zip codes will be full of poor people, some zip codes will have people who are more educated than people in the rest of the state. Some zip codes will have more families, others will have more elderly people living alone. Some zip codes will be densely populated, some will be populated.

Think of it this way: if you picked 20% of the zip codes in NY State, but all of those zip codes were upstate in the Adirondacks and you had none in New York City, would your sample be valid for all of New York State? No. What if you picked a bunch of zip codes from the Upper East Side, would your sample be valid for all of New York City? No.

It's not just a matter of picking the right number of zip codes. It's a matter of understanding the demographics of your state. There's no shortcut here. You can't just pick a certain number of zip codes and be all set.
posted by alms at 11:43 AM on February 24, 2015 [8 favorites]

What Betelgeuse and ROU_X said. But if the rate-limiting step is, for instance, copying data off of a slow website by hand, you would probably be better off spending your time automating those tedious bits (using something like BeautifulSoup in Python) as opposed to spending a lot of time trying to come up with the best way to sample by zip code.
posted by en forme de poire at 11:54 AM on February 24, 2015 [2 favorites]

It might help if you can tell us more about the question you're trying to answer. Are you looking for a specific correlation (e.g. between poverty and insurance coverage)? Or something more broad? I'm struggling to think of something related to health insurance that wouldn't be affected by demographics, population density, etc., so I agree with the cautions in the comments above.
posted by desjardins at 12:03 PM on February 24, 2015

You can do it in steps.

1. Determine the distribution of population by zip code.
2. Determine the breakdown of population of the state by income, age, other pertinent factors.
3. Then take a sample, depending on those distributions.

There are other factors, such as which providers offer services in each area code and how does that affect adoption.

For example, in Atlanta, GA where I live, I can choose from over 100 providers. In my MIL's county in rural KY, she can only choose from 3 providers.

That sort of thing.
posted by Ruthless Bunny at 1:09 PM on February 24, 2015

I'd like to respectfully disagree on this point:

It's not just a matter of picking the right number of zip codes. It's a matter of understanding the demographics of your state. There's no shortcut here. You can't just pick a certain number of zip codes and be all set.

Yes, you can if you do it randomly. You won't be "set" in the sense that it's impossible you got all rich or all Democratic zip codes, but you will be set in the sense that you know this would be unlikely. This is the entire point of classical statistics: as your sample size increases, the statistics of the sample are more likely to be close to the statistics of the population. That holds whether the samples are cookies or people or zip codes.

The subtle thing is that this lets you generalize to the zip codes in the state, not the people in the state. So as other people have mentioned, things get a bit more complex as you (probably) need to weight the measurement from each zip code by its population.

'Confidence interval' is what you need to read up on here.
posted by cogitron at 1:28 PM on February 24, 2015 [10 favorites]

You want to look at multistage sampling processes (in your case, 2-stage sampling). This link has a decent introductory explanation with some examples - the basic idea is that you would first do a simple random sample of the zip codes (for instance, randomly selecting 40 of your 709 zip codes), then another simple random sampling process within each chosen zip code. To know how many zip codes, it will be helpful to look at things like power analysis and studies that are similar to what you want to carry out.
posted by augustimagination at 1:43 PM on February 24, 2015 [1 favorite]

Some examples from my own state of residence:

Open this link in a separate tab.

Check 21045. Look at the density of housing, and the amount of area covered.
Check 21787. Look at the density of housing, and the amount of area covered.

In other words, what alms said: Some zip codes will be full of wealthy people, some zip codes will be full of poor people, some zip codes will have people who are more educated than people in the rest of the state. Some zip codes will have more families, others will have more elderly people living alone. Some zip codes will be densely populated, some will be [sparsely] populated.

You can't just pick a handful of ZIP codes willy-nilly and expect data about them to be representative of your entire state. You can pick them randomly, but then you have to do several random picks: 50 ZIP codes from Humongous City. 50 ZIP codes from Rural County. 50 ZIP codes from the Snootyville area. 50 ZIP codes in the area of Welfareton. 50 ZIP codes from Suburbia County. And so on. You need to already know some things about different parts of your state before you can intelligently select a sample of representative ZIP codes.
posted by tckma at 1:47 PM on February 24, 2015

Also, I'm wondering if you're reinventing the wheel here. Kaiser Family Foundation and others probably already have this data, certainly at the state level. What does the exchange itself say?
posted by postel's law at 2:21 PM on February 24, 2015 [2 favorites]

Yes, you can if you do it randomly.

Not necessarily. If, say, rural zip codes are lower-population than urban zip codes, then there will be "too many" rural zip codes and a random sample of zip codes will be biased towards the rural parts of the state.
posted by ROU_Xenophobe at 2:51 PM on February 24, 2015 [3 favorites]

If you select the zip codes randomly, this is a form of cluster sampling.

If, say, rural zip codes are lower-population than urban zip codes, then there will be "too many" rural zip codes and a random sample of zip codes will be biased towards the rural parts of the state.

I believe the standard way to deal with this is "probability proportional to size" sampling within the clusters. If one zip code has 8000 residents and another has only 200, you would survey 40 times more people from the first zip code than the second.

It is still the case that sampling error will be higher with cluster sampling than with simple random sampling, for a given total sample size. Cluster sampling is better only if it allows you to get a sufficiently bigger sample overall.
posted by mbrubeck at 3:16 PM on February 24, 2015 [3 favorites]

Your answer is probably probability proportionate to size (PPS) sampling.

That said, are you trying to talk about household data based on zip code data? That you cannot really do. You can't use aggregated zip code data as the value for all HH in that zip code. If you want to draw conclusions about what individual HH or families do, don't use zip code data. If you're using zip code data, then you can only talk about what happens in zip code areas, not what happens at the household level. Check out the ecological fallacy.
posted by quadrilaterals at 3:21 PM on February 24, 2015 [5 favorites]

You can also buy a lot of this information. One think that sprung to mind was Prizm, which is very dense market segmentation information, by zip code. If you have a budget for such a thing.

Also, I love their Zip Code look up and the segment definitions.
posted by Ruthless Bunny at 4:12 PM on February 24, 2015

quadrilaterals makes an important point. Those of us suggesting multistage/cluster sampling are assuming that once you have a zip code picked, you are then able to get data on individuals within that zip code. If that is not the case, and instead you only have zip-code level data, it would be a problem for you to draw inferences about individuals no matter what kind of analysis you do. The general rule is that you can only draw inferences about "levels" which are present in your data. Zip code-level data means you can only make conclusions about zip codes, while individual-level data would allow you to draw meaningful conclusions about the individuals in your state themselves.
posted by augustimagination at 9:17 AM on February 25, 2015 [1 favorite]

« Older Help me find more songs like Roundabout by Yes | I'm looking for academic quiz shows Newer »

This thread is closed to new comments.

Ask MetaFilter

Extremely basic statistical question
February 24, 2015 11:03 AM Subscribe

Tags

Share

Extremely basic statistical question February 24, 2015 11:03 AM Subscribe

Tags

Share

Extremely basic statistical question
February 24, 2015 11:03 AM Subscribe