Program or method for selecting random stratified sample?
June 28, 2018 3:52 PM   Subscribe

I'm looking to take data sets (usually 400-1000 rows) with multiple variables (usually age group, gender, education level, political affiliation, ZIP code, etc.) and select from the full set a group (usually 18-42) that best matches pre-defined targets for those variables.

My organization, from time to time, convenes groups that aim to be "microcosms" of a given community (city or county, typically). We collect demographic information from potential participants (see above) and then select a group from that data that matches the demographics of that community. For example, if a county is 56% Republican-leaning, 24% 65+, 44% High School degree or equivalent, etc., we'd want the selected group to be 56% Republican, and so forth.

For smaller datasets, we've hacked together an Excel solver function that gets us pretty close to a "representative" group, but the function breaks with larger data sets and we have to resort to selecting by hand, which is less than ideal.

Are there programs or strategies we could implement to randomly select a "representative" group from a larger set that's stratified according to our desired demographic targets, and that could be changed from project to project where the data sets and demographic targets change?
posted by MetalFingerz to Technology (5 answers total) 2 users marked this as a favorite
 
Best answer: I am a data scientist who has previously worked a little with gov survey data like the Census. The gold standard for this is not to throw out samples until you have a group that looks like your target subpopulation, but to do a weighting technique called "raking"

Here is the basics of how this could be done in Excel but this example could fall apart quickly
https://help.xlstat.com/customer/en/portal/articles/2062302-raking-a-survey-sample-tutorial-in-excel
(they are using categorical variables that have been quasi-normalized by turning them into ordinal integers, and you should not rake the way they are showing across variables that are not normalized, like political affiliation represented as 0/1 and household income represented as a scale or bins. the math isn't complicated but representing your variables appropriately might be trickier than you assume)

Basically what you are doing with this technique is saying Common Sandy and Typical Sam count in your sample but they count way less than Rare Sue, and you are representing their overall commonness or rareness across all variables of interest at once in a single weight. You can then do simple descriptive stats, like weighted average income, and come out with a number between them most reflective of Rare Sue without only using her data.

More academic article with more about the methods and math http://faculty.nps.edu/rdfricke/docs/RakingArticleV2.2.pdf
posted by slow graffiti at 4:45 PM on June 28, 2018 [6 favorites]


oh, and if you have too big of a sample to handle this in Excel, time to learn to program in R! steep learning curve if you never programmed but its free, has a Metafilter-quality internet community behind it, and the package 'survey' has a fancier equivalent of your Excel solver function that could nearly instantly handle millions of data points http://r-survey.r-forge.r-project.org/survey/
posted by slow graffiti at 4:50 PM on June 28, 2018 [4 favorites]


Also came to say this sounds like you should be using R instead of Excel.
posted by forkisbetter at 7:59 PM on June 28, 2018


A couple remarks about sample size. A thousand rows is nothing to Excel. It will filter or sort 50,000 rows in the blink of an eye. Computers have also made it possible to use the entire population for many things had to be sampled in the past. My favorite example is that when the Postal Service verifies an address, it's not just checking the format, it's checking the address against every address in the US.
posted by SemiSalt at 7:18 AM on June 29, 2018


Best answer: I'm not sure that raking is what you're really looking for -- that's a good technique to scale a sample from a subset up to match an overall population, but I think you want to pick a subset instead to do more detailed quantitative or qualitative study of, such as a focus group or the like. In this case, to pick 20 people from 1000 that best match the observed distribution along multiple axes, you're really posing a combinatorial optimization problem.

The bad news is that combinatorial optimization problems tend to be difficult (in the computer science NP sense where they are impossible to solve in reasonable time); the good news is you don't really need the single best alternative, you just want a pretty good one so you don't actually have to solve the whole thing.

The general class of algorithms I would use are stochastic optimization algorithms, and I would implement them in a programming language, rather than R or another stats approach because really it's a lot of looping. (My company has implemented something quite similar that might work for you although it's not super user friendly, MeMail me if you want). Really, the approach I would try is to draw a random sample, score it against the desired demographic components, then repeat a million times and keep the best ones.

Once you have the sample, you could then use raking to adjust the weights to precisely match the population; in your example if you're picking 10 people, the best you could do is have 60% Republican, 20% 65+ (since with 10 people you can only match to the nearest 10%) and it's possible you'd only be able to be 30% HS just because of the unique combinatorics of the sample, so you could use raking to adjust the weights back up or down to the exact population weights.

In theory, you could just do a random sample and then use raking or another weighting procedure after the fact to adjust weights, but given the small samples you are doing, you could very easily wind up with extreme weights -- during the 2016 election the LA Times did a panel survey that had really odd results compared to the other surveys because there was a single African-American who was a young man from Illinois who happened to be a Republican and he was really heavily weighted in the sample. In fact, given the small samples, you could easily wind up with nobody in a particular demographic category, and you can't scale up nobody. (I suspect this is moot and you are actually looking at the detailed behaviour of the group rather than just weighting their survey responses.)
posted by Homeboy Trouble at 10:45 AM on June 29, 2018 [2 favorites]


« Older Stockpiling Plan B?   |   My cat sh*t on my sheepskin rug Newer »
This thread is closed to new comments.