What Do I Do With All This Data?
October 22, 2015 8:21 AM Subscribe

I have a research project involving about ~90 subjects in two (self-selected) groups. I have collected a number of variables about these subjects (from public sources), and now I would like to do some statistical tests to say whether these variables are significantly correlated with membership in one group or the other. How do I do this? I have a basic knowledge of R and access to Stata 14.

For example, suppose there are 45 people in each group and 30 people in Group A went to Harvard whereas only 5 people in Group B did. That seems like it might be significant, but I don't know the right statistical test to use or how to tell R or Stata to do it.

If it helps, almost all of my variables are categorical (e.g., has this kind of degree, donated to this political party).

posted by jedicus to Science & Nature (10 answers total) 2 users marked this as a favorite

Categorical variables will lead you to a chi-square analysis usually.
Correlation requires two continous variables.
You're looking for differences.
posted by k8t at 8:29 AM on October 22, 2015

Chi-Square really if you're just wanting to see if the proportion of group membership is statistically different between two groups...
posted by Young Kullervo at 8:29 AM on October 22, 2015

Upon further reading, yeah, difficult to tell the direction of your hypotheses, but if you want to see if Group A (College) or Group B (Other College) had a higher proportion of participants that donated to x political party then definitely a 2x2 crosstabulation with Chi-Square goodness of fit test.

Membership in Group A or B is the dependent variable. Colleges attended, degrees received, parties donated to, etc are all independent variables. But I don't think that would change the answer.
posted by jedicus at 8:43 AM on October 22, 2015 [1 favorite]

If it's mostly 2x2 tests that you're interested in, you can get the same results slightly more simply in R by using the binomial proportions test. In the example you gave, the syntax would be
prop.test(c(30, 5), c(45, 45)), and it gives you a p-value and confidence intervals.
posted by muhonnin at 8:44 AM on October 22, 2015 [1 favorite]

Binomial proportions test is also rad.
posted by Young Kullervo at 8:53 AM on October 22, 2015

If you want to look at multiple variables at once, rather than 2*2, perhaps a logit or probit regression.
posted by yesbut at 3:14 PM on October 22, 2015

Do you want to find out the marignal effect of any of your categorical variables, conditional on the other variables? Then just do a regular linear regression model. Since you have a binary dependent variable this would amount to a linear probability model, which is much much easier to interpret than a logit or probit.
posted by MisantropicPainforest at 4:00 PM on October 22, 2015

In R this would look like,

lm(dependent variable name ~ independent variable 1 + independent variable 2, data=mydata)
posted by MisantropicPainforest at 4:02 PM on October 22, 2015

If you're going ahead with a linear model or one of its generalizations, you probably want to consider the following traps for the novice:

1. Multicollinearity: are your predictors correlated to each other? This can lead to strange results (if you put two correlated variables in the model, the best fit might have the "wrong" sign for one of them, for instance) and unstable estimates. If you really wanted to solve this problem computationally, there's a technique called ridge regression (a type of what's called "regularization") would be one way to do that, but that's sort of advanced, so just being aware of which things are correlated and how this can affect the results is probably enough here.

2. Multiple testing: You should adjust the final set of p-values you get using the R function p.adjust to correct for multiple testing. If you have a whole lot of variables, I'd use the 'BH' or 'BY' methods, which gives you control over the false discovery rate; if you don't have very many, something like 'Holm' might be more appropriate (like Bonferroni but often a little more powerful).

3. Model fit: Do the assumptions of the model generally hold up? You can get significant p-values for your predictors and still have a model that performs in odd, sub-optimal ways. You can get a sense of this by looking at the residuals, i.e., the difference between the predictions from the fitted model and the actual values. If the residuals look strange (e.g., very asymmetric, or very different variances in different parts of the graph) it may indicate you may need to, e.g., transform one of your input variables (for example, instead of "age", use "log(age)" or "sqrt(age)". See here for some examples of diagnostics.

Pace MisantropicPainforest, I'd personally recommend logistic regression, not linear regression. I think interpreting the logistic coefficients is going to end up being a lot more natural: they are just the natural log of the odds ratios, so just make sure you understand it's not the same thing as the relative risk. Linear coefficients for a binary outcome can end up being sort of weird and non-intuitive. Plus, in R doing logistic regression is really no harder than fitting a linear model. You would code the variable you're interested (say, "group.membership") in predicting as a binary variable, then throw everything into a big glm:

results <- glm(group.membership ~ age + college + gender + whatever + 1, family = binomial(logit))
summary(results)

Hope that helps.
posted by en forme de poire at 12:53 PM on October 23, 2015 [1 favorite]

« Older Erasures. (Not Erasure, much as I love synth-pop.) | What book(s) is the bible as a real estate... Newer »

This thread is closed to new comments.

Ask MetaFilter

What Do I Do With All This Data?
October 22, 2015 8:21 AM Subscribe

Tags

Share

What Do I Do With All This Data? October 22, 2015 8:21 AM Subscribe

Tags

Share

What Do I Do With All This Data?
October 22, 2015 8:21 AM Subscribe