Stats Filter -- Smallish Sample Size
October 17, 2018 11:53 PM Subscribe

My statistics experience is rusty. I want to compare about 175 cities to each other. I'm not sure what is the best way to do that.

I have several independent variables and a few dependent variables.

Is regression analysis appropriate with this sample size? If not, what would be better?

posted by maurreen to Science & Nature (17 answers total)

That's a very vague question. What exactly are you comparing between the cities? What's the variability like? Do you think the regression is linear or non-linear?
Regression is probably workable as it's a highly flexible tool. It just might not be the best for your situation.
posted by solarion at 1:50 AM on October 18, 2018 [4 favorites]

OLS is fine for that sample size. It might not be fine for the particular DVs you're looking at, at least not if you're intending to do something serious with your results.
posted by GCU Sweet and Full of Grace at 4:21 AM on October 18, 2018 [1 favorite]

Yeah in my experience doing regression with cities (even counties) in the US all of the effects are driven by 5-10 observations. Regression allows you to control for population, density, size, etc. Be careful about relying on parametric significance testing, both for violating normality assumptions and effective degrees of freedom.
posted by supercres at 6:31 AM on October 18, 2018

What’s the population that you are trying to infer something about? That changes the effective sample size too. Eg the same 175 cities may be a large fraction of all modern cities in the USA that have over a certain population, but constitute only a tiny fraction all human settlements across the world and all known history. The sample size is effectively larger in the first case.

It is not (effectively larger). Statistical inference (i.e. generalizing to a population) is based on the assumption of a population of infinite size. Essentially every sample is assumed to be an infinitely small proportion of the population. So it is the absolute size of the sample that matters (and that is used to calculate the standard error -- and from their things like margin of error, confidence intervals etc), not the proportion of the population sampled. 175 isn't small but it isn't large either. Like you said, smallish. I might even say medium-ish.

Anyway, you if you have multiple independent variables (and controls) that you want to put in together then you want regression. What kind of regression will depend on what your dependent varables are. For continuous interval dependent variables use OLS. For binary use logistic. For counts use poisson or negative binomial. For nominal use multinomial. For ordinal use ordinal. If your variables are lengths of time (how long was it before/how long between etc. etc.) then you want event history analysis (also called survival analysis). You will need one (or more) regression models for each DV.
posted by If only I had a penguin... at 6:53 AM on October 18, 2018 [3 favorites]

Oh, and in addition to what supercres said, be careful about having models that include both population and rates of anything as well as multiple rates of anything. Remember that these will *always* correlate by definition and the correlation will be about -.5* so you'll get statistical artifacts or you'll hide stuff that's actually going on if you have both in the model.

* Go ahead, try it in excel. Open up a blank spreadsheet. Label one column "Population" one "Inches of Rain last year" one "number of murders". Fill both columns using the random number function. Use the population column to calculate the inches or rain per person and the number of crimes per person. Calculate the correlation between population and inches per rain per person, and the correlation between crime rates and population. Correlation is about -5, right? Refresh to get new random numbers. Correlations are still around -5. They always will be. Calculate the correlation between crime rates and inches of rain per person. About .67 right? Refresh...Still .67ish.
posted by If only I had a penguin... at 7:06 AM on October 18, 2018 [1 favorite]

Interesting! What multiplier are you using on the murder and rain columns, If I only had a penguin? ie is it "=RAND()", or is it "=RAND()*Population"?
posted by Mogur at 7:21 AM on October 18, 2018

Finite population size is a real thing. We mostly teach and use stats that use assumptions of infinite populations, but that simplifying assumption is not always good and useful. If anyone needs further references on that I can provide them, here is a historical overview. The simple fact is that statistical hypothesis testing on finite populations can be performed, and in that case the size of the finite population changes the predictive power of the same finite sample, depending on what it is considered a subset of.
posted by SaltySalticid at 7:43 AM on October 18, 2018

Your population size is almost always infinite because your population of interest almost always includes an infinite number of counterfactual cases.

If you are researching educational outcomes and spending per student in American school systems, your population is not the around 15,000 school districts that actually exist. The population you're interested in almost certainly includes "PoorRuralCountyInMS but spending like RichyRichpantsDistrictInWestchester" as well as "RichyRichpants District but only spending what the average suburban district does," and a literal infinity of other counterfactuals.
posted by GCU Sweet and Full of Grace at 8:26 AM on October 18, 2018 [2 favorites]

Mogur: Not sure what you mean by multiplier. I double-checked just to make sure I wasn't mis-remembering from grad school before posting.

To get crime I did "=RANDBETWEEN(500,50000)" I used the same for "births" (which I did instead of rain, but obviously you can name the columns anything you want. It would just be a very rainy place). To get population I did "=RANDBETWEEN(5000000,250000000)" then with those columns I calculated the birth rate (birth/pop) and crime rate (crime/pop) and checked correlations.

And yes, finite populations are thing, but there's an epistemological difference if you want to adjust for finite populations. You would have to be assuming that you're interested in the specific set of cities in the population, rather than in the set of cities covered by the concept of the modern city. I'm not explaining it clearly, but it's kind of like the difference between using fixed effects (analogous to finite population here) and random effects (analogous to using finite pop).
posted by If only I had a penguin... at 8:29 AM on October 18, 2018

Uh, yeah what GCU said re:infinite/finite populations. They explained it much better.
posted by If only I had a penguin... at 8:30 AM on October 18, 2018

If I’m studying the top 200 cities in the USA, and I have data on some random 175 of them, I’d most definitely be looking at techniques better suited to this case than standard statistical machinery, but this is not a place for discussion, so I’ll stop there.
posted by SaltySalticid at 9:17 AM on October 18, 2018

If really all you care about are those 200 actually-realized cities, there's very little reason for anyone to care about what you find. By making that claim, you're stating that your work has no possible application. Applications change things, and those changed things are explicitly not part of that population of 200 realized cities.

I occasionally have to smack submission-authors around about this with respect to US states. No, your having all 50 of them does not mean you have the full population.
posted by GCU Sweet and Full of Grace at 9:46 AM on October 18, 2018 [2 favorites]

Thanks for all your comments. Here are answers to some questions:

I do expect the regression to be linear.

The 175 cities represent all of one category, and all cities in the United States in a few other categories. I am including larger cities but not focused on them, partly because the country has more smaller cities, and partly because research often overlooks smaller cities.

Thanks for the warning about rates. I hadn’t thought about that, but it does make sense. The dependent variables are rates based on population. But I can avoid using rates based on population for the independent variables.

I do not understand GCU’s comment at 9:46.
posted by maurreen at 10:14 AM on October 18, 2018

You should not assume that you have the full population just because you have all observations in some set of categories.

You have data on Kalamazoo as it actually exists now and Terre Haute as it actually exists now and Burlington VT as it actually exists now. If you assume you have the full population of cities in those categories, or that the population is that set of cities as they exist now, you are explicitly saying that what might happen if Kalamazoo adopted policies closer to Burlington's is of no interest to you. You are explicitly stating that Kalamazoo five years from now is of absolutely no interest to you. You are explicitly stating that your work and findings cannot be used to talk about those cases because you've defined your population of interest to exclude them.

If you care about or are interested in either outcomes that could have been realized but weren't (Kalamazoo but different policies) or outcomes that haven't been realized yet (Kalamazoo in the future), then your actual population of interest is an infinite array of possibilities, not the few cities that actually exist.

As a practical matter this just means using the normal commands for regression without doing anything weird or fancy, because those already assume an infinite population.
posted by GCU Sweet and Full of Grace at 10:37 AM on October 18, 2018 [3 favorites]

Hey I’ve studied axiomatic theory of probability and statistics extensively, and I sometimes have to smack around authors who crank out analyses without considering their model or assumptions.
If you have a finite population with finitely many variables that take on finitely many values, then your entire state space is finite. This is just the simplest example to explain, there are many other reasonable cases.

My comments on finite state space and sample size may not be something that op wants to explore, but I’m not wrong.
posted by SaltySalticid at 12:07 PM on October 18, 2018

And finally, for anyone who was left confused by that interchange and is still interested: here’s some Wikipedia links that cover some of the basic theory of sample size in the context of finite populations. This is all stuff we teach to undergrads, I probably should have spent more time and provided these at the start. Sorry about that, I didn’t expect such strong pushback against material that I thought was well known.

It is true that for many purposes, the assumption of infinite population is good and useful. For the question here, it is probably fine.

However, it is also true that when your sample size gets large in proportion to your population size, errors go down, though the types of inference you can draw are epistemologically and ontologically different.

My point above was that the latter framework may be worth exploring, depending on the goals of the analysis, for example if there is an interest in a relatively small set of cities and describing how they are.

https://en.wikipedia.org/wiki/Sampling_fraction

https://en.wikipedia.org/wiki/Standard_error#Correction_for_finite_population

https://en.wikipedia.org/wiki/Analytic_and_enumerative_statistical_studies
posted by SaltySalticid at 7:46 AM on October 19, 2018

« Older Most Earth-friendly way to give myself diabetes | Baking for people with nut allergies Newer »

This thread is closed to new comments.

Ask MetaFilter

Stats Filter -- Smallish Sample Size
October 17, 2018 11:53 PM Subscribe

Tags

Share

Stats Filter -- Smallish Sample Size October 17, 2018 11:53 PM Subscribe

Tags

Share

Stats Filter -- Smallish Sample Size
October 17, 2018 11:53 PM Subscribe