How to calculate uncertainty when redistributing a population
March 5, 2017 5:02 PM   Subscribe

I am trying to understand a statistical problem with resampling and variation, and I’m unsure of my terminology so it’s hard to search. I have a set of geographic areas with known populations, and I’d like to generate new areas with redistributions of those populations. I can assume that the population is evenly spread throughout the original areas, but this gives me numbers without any kind of confidence interval. What would I need to do to calculate variance or deviation for this problem?

I am using “resampling” in the digital audio sense, while the statistical term doesn’t seem to match what I mean. My problem is two-dimensional, but I think I can illustrate it in one dimension. Here is a population of 16 grouped into four equally sized areas along a line:
1 1 1 1|2 2 2 2|3 3 3 3|4 4 4 4
Let’s say I want to redistribute that population into three unequally-sized area, like this:
A-A-A-A-A-A B-B-B-B-B C-C-C-C-C
With even distribution of the population, I would expect area A to have four 1s and two 2s, area B to have two 2s and three 3s, and area C to have one 3 and four 4s:
A-A-A-A-A-A     B-B-B-B-B     C-C-C-C-C
1 1 1 1|2 2     2 2|3 3 3     3|4 4 4 4
So that’s nice, but it’s probably incorrect and I’d like to know by how much. My input areas don’t necessarily have evenly-spread-out populations. They may be clumped, which would lead to some uncertainty about how the three output areas are populated:
A-A-A-A-A-A     B-B-B-B-B     C-C-C-C-C
1 11  1|222      2 |33 33      |44  4 4
How could I express this statistically? What would I need to know about the four input areas to create meaningful error bars for the three output areas?
posted by migurski to Science & Nature (10 answers total)
 
Sometimes these things are easier if you step away from the abstractions and talk about what you're actually trying to do, if that's not secret or similar. There may well be an accepted norm for how to deal with the specific situation you're facing and a CI might not even be regarded as necessary for it.
posted by ROU_Xenophobe at 5:44 PM on March 5, 2017 [1 favorite]


Response by poster: Sure, yeah. I have demographic data arranged in tracts, each of which has a population count, e.g. 100 people in a sample tract. I’d like to create new districts with boundaries that might cut through the tracts, and I want an estimate of the resulting district population counts. A district that includes 50% of this tract by area might include 50 of the people in it. Or it might include 0 or 100 of the people in it, depending on how they’re spread out.
posted by migurski at 5:54 PM on March 5, 2017


If you have no information about the population clumping (aside from the fact that it might exist), then there's not much you can do re: meaningful statistics.

Do you have some measure of how clumped the population might be? Is there a maximum or minimum clumpiness? Any idea where the clumps might live within each tract? These will all factor into your sense of error.

I also agree with ROU_Xenophobe in that a CI might not be the best way to deal with the situation. If you really needed a CI, then 1) you'd need more baseline information, and 2) presumably someone would specifically be asking for this CI. But then what would the person do with the CI? What are they trying to get at?
posted by miniraptor at 6:00 PM on March 5, 2017


Response by poster: Thanks for the followup questions. There are a couple of ways I could determine clumpiness: U.S. Census has smaller subdivisions called blocks; they can be quite fine-grained and offer a higher-resolution view of population. There are also gridded population datasets that offer a population count per grid square. Either of these would allow me to have some idea of population distribution inside the tract or district.
posted by migurski at 6:13 PM on March 5, 2017


Okay. I'm not well-versed formally in statistics, but hopefully this won't be too inaccurate nor too far from what you had in mind. There's almost always a fudge factor written into this sort of thing when you're dealing with real data, anyhow. But please correct me if anything seems awry.

I think that your best bet is to simply run your redistricting thing over a map of (smaller) blocks, instead of your original (larger) tracts. Then you can just add up populations in all the blocks within each new district, and assume a uniform distribution within each block. Which is probably what you would have done anyway...but you asked about errors.

While redistricting, your new borders will cut through some number of blocks. If the total population of a new district's border-blocks is not unreasonably large (e.g., more than 5% of your new district population), I'd just set my error bars to be +/- half of the total border-block population. Simple.

Alternatively, if you have uncertainties for each block: just calculate everything by assuming a uniform population distribution (within each block), then add the block uncertainties as follows:

Say my new district contains blocks A, B, & C, and its borders cut through 50% of block D and 20% of block E. The uncertainties for each block are sA, sB, sC, sD, and sE. Then the uncertainty for my new district would be:

sqrt(sA^2 + sB^2 + sC^2 + (0.5)^2 * sD^2 + (0.2)^2 * sE^2)

This isn't precisely correct, but honestly you can make this arbitrarily complicated depending on what information you have available.
posted by miniraptor at 7:00 PM on March 5, 2017


The Chi-squared test seems to get close to what you're asking.
posted by oceano at 11:33 AM on March 6, 2017


I'm having a hard time envisioning this as a statistics problems. Let's say that you have tracts T1, T2, T3,... and also (for lack of another term) counties C1, C2, C2, C4. We can set up a matrix with columns for T1, T2, T3, etc, and rows for C1, C2, C2. We want to populate the matrix with then number of people that belong in the associate tract and county. The columns will sum to the population of the tract, and the rows will sum to the population of the county.

Most entries will be zero because the tract and county do not overlap.
If the tract is entirely within the county, the entry will be equal to the population of tract.

To get a distribution, you have to have a notion about how the remaining numbers are put in the matrix. In the abstract, just finding a single solution could be pretty tricky. In your case, since I gather you have a solution, you might be able to explore the solutions local to the one you have. For example, you might be able to find the min and max number for each cell.
posted by SemiSalt at 1:15 PM on March 6, 2017


Best answer: If you need to be really exact about this, go talk to the geographers.

Otherwise, if you can link Census blocks to your existing tracts and new districts, and if your demographics of interest aren't recorded at any level lower than the tract:

(1) Assign each block to whichever existing tract its centroid is in
(2) Break existing tracts down into blocks
(3) Build new districts out of blocks, but all you care about is the proportion of the tract in each district
(4) If 37\% of existing tract 1 is in new district A, then assign 37\% of its population to district A. Just multiply all its characteristics by .37, or if you feel like being fancy then monte-carlo it.
posted by ROU_Xenophobe at 7:10 AM on March 7, 2017 [1 favorite]


Looked at the webpage your profile links to.

Is this so that you can create the demographics for hypothetical US House or state legislative districts in line with your redistricting work? Hypothetical district 1 has x\% black people with family incomes over whatever, etc.

If so, and if this isn't going to be directly used for work with courts -- in which case you'll want to talk with statistically-aware election-law lawyers, or Mike MacDonald and/or Dan Smith at UF about who to talk to -- then doing the plan I outlined will be good enough.

You probably know this already, but different census variables are available down to different geographies. You might want to take a look at the ones only available down to the tract to see if you're likely to use those variables. If not, then just straight-up assigning each block group to whichever hypothetical district contains its centroid and building up from block groups will almost certainly be Good Enough.
posted by ROU_Xenophobe at 7:20 AM on March 7, 2017


Response by poster: Thanks. The plan you outlined is exactly what I’ve been doing: straightforward redistribution based on percentage area covered. I’m trying to build some statistical awareness into the process early, since I know how easy it is to assume simple uniform coverage.
posted by migurski at 8:28 AM on March 7, 2017


« Older Adopting a three-legged cat- are there...   |   I need advice on how to advertise a book Newer »
This thread is closed to new comments.