Where to start identifying relationships in a set of numerical, binary, menu data
May 7, 2008 5:16 PM   Subscribe

Ok so I have this huge table of survey data - much of it numerical, much of it binary, some of it from selections from menus of text items (e.g. blue, green, orange etc). Where do I start to find the most noticeable relationships between variables?

I have some familiarity with regression analysis and am equipped with R (free stats package, but not too familiar with all its functionality). But how do I
a) deal with the binary and menu-based data?
b) start to find the most significant dependencies? Just randomly? (I mean for example, maybe I will discover that all females between 25 and 30 who like the colour pink tend to eat lots more icecream on Thursdays.)

Even a text book or a tutorial telling me what stats I need to know would be useful.
posted by vizsla to Science & Nature (13 answers total) 2 users marked this as a favorite
 
Try Vector Quantization (two links). My ex did work on that for her dissertation in the 90's, so it should be in textbooks by now. Bob Gray, one of the authors in the second link, is her academic "grandfather".

The vector under discussion is in parameter space: you want to quantize your data to form a set of n-dimensional Voronoi cells. You can do lossy compression by approximating the full set by a representative element in each cell.
posted by Araucaria at 5:29 PM on May 7, 2008


That is huge overkill, especially for someone who says they have some passing familiarity with regression.
posted by ROU_Xenophobe at 5:37 PM on May 7, 2008


Best answer: a) deal with the binary and menu-based data?

Binary dependent variables: logit (aka logistic regression) or probit. Either one is fine. Basic interpretation -- this variable has a positive effect and is significant -- is easy; just a matter of looking at the coefficients and SEs. Substantive interpretation takes more work.

OLS will get you a decent idea to start with.

"Menu-based": it depends on what the variable is.

If the variable has an order to it, like strongly disagree / disagree / neither / agree / strongly agree, then you can do ordered logit or ordered probit. Again, OLS will get you a decent picture to start with. Again, basic interpretation is easy and substantive interpretation trickier.

If the variable has no order to it, like chocolate/vanilla/strawberry, then the answer is multinomial logit. There is such a thing as multinomial probit too, but you shouldn't need to worry about it. WARNING: YOU CANNOT DIRECTLY INTERPRET THE COEFFICIENTS IN A MULTINOMIAL LOGIT MODEL. It is possible for a variable with a positive coefficient to have a negative effect.

b) start to find the most significant dependencies? Just randomly? (I mean for example, maybe I will discover that all females between 25 and 30 who like the colour pink tend to eat lots more icecream on Thursdays.)

Sweet zombie Jesus, no. The way you discover the most significant dependencies is to stop and think about what the most significant dependencies are and devise an appropriate theory, and consult the existing literature on the subject. Then you include the independent variables that your theory says should matter, and the ones that the existing literature has found to be important (to the extent that you can). AND THEN YOU STOP.

If you regress everything against everything else, you'll find a bunch of significant coefficients -- in fact, I just about guarantee you that at least 5% of them will be significant at a 0.05 level. How will you be able to tell the true causally important variables from random shit? YOU WILL NOT. Atheoretical empiricism of the type you describe makes baby Jesus cry and makes Santa himself vomit with rage.
posted by ROU_Xenophobe at 5:54 PM on May 7, 2008


Best answer: b) start to find the most significant dependencies? Just randomly? (I mean for example, maybe I will discover that all females between 25 and 30 who like the colour pink tend to eat lots more icecream on Thursdays.)

Multi-model inference using Generalised Linear Models - which encompass the various distributions ROU_Xenophone has mentioned.

Essentially you come up with a suite of candidate models, with one variable you're focusing on:

Icecream ~ Day
(Icecream consumption is related to the day of the week)

Icecream ~ Sex
(Icecream consumption is related to the sex of the person)

Icecream ~ Sex + Age
(Icecream consumption is related to the sex and the age of the person)

Icecream ~ Sex * Age
(Icecream consumption is related to the sex and age of the person, but with an interaction - for example, young men might like ice cream more than old men, but old women might like it more than young women)

Icecream ~ Null
(The null model)

These various models should be determined beforehand, as ROU_Xenophone suggests, based on past knowledge. Don't just throw everything in the mix. Have a meaning for each model in mind as you construct it, to represent a specific hypothesis.

You then run a GLM on these models (explore the "glm" command in R - tada!), and can use the AIC or BIC rankings to determine which model best explains the data. AIC and BIC give you the likelihood of the model, biased by the number of parameters. In other words, Sex alone might explain 50% of the variation in the data, but Age * Sex * Day * Clothes Colour * Income * Sexual Kinks might explain 55% of the variation. Is the last model better? No way. AIC and BIC help find the simplest model that works best. AIC tends to be better for prediction - it tends to include a few more parameters to get a good fit - while BIC tends to be better for understanding the most very important variables.

Go look up "generalized linear models" "multi-model inference" "multi-model averaging" "maximum likelihood".
posted by Jimbob at 6:14 PM on May 7, 2008


Response by poster: Thanks very much for the replies. Already looking up probit, logit, GLM. Might get on to vector quantization after I beef up my skills a lot more.

Just one thing though. If you assume the nature of the likely dependences, isn't it possible you miss the really interesting relationships that nobody ever thought of before? Maybe traditional wisdom has it that snakes tend to strike on hot days in spring. But maybe it turns out that a very small percentage of snakes that are under 1 metre strike when there is a cold southerly in open fields. I remember going to a marketing seminar about an online wine site and they figured that the people who visited their scientific information pages were experts who checked them out before deciding to purchase. It turned out that many were amateurs who were just trying to educate themselves so they could impress others. While these were a minority, they were also the ones most likely to buy. Won't I miss such gems of discovery if I assume the key dependencies already?
posted by vizsla at 6:24 PM on May 7, 2008


Won't I miss such gems of discovery if I assume the key dependencies already?

Assume the lotus position and repeat after me. Correlation is not causation. Correlation is not causation.

What you're talking about is what experiments are for - you cannot be assured of the meaning of these sorts of relationships from data mining alone. If you do find some kind of complex, interesting, novel relationship, you would have to isolate the factors and do a controlled study, a manipulative experiment on it, to make sure it's meaningful.
posted by Jimbob at 6:53 PM on May 7, 2008


Won't I miss such gems of discovery if I assume the key dependencies already?

Yes. You will also miss a whole bunch of entirely spurious relationships that are nothing more than random noise talking to other random noise. And if you looked and found those relationships, you would be utterly incapable of distinguishing the gems of discovery from random noise.

There is a right way to link inductive and deductive logic in a useful feedback loop. This is to inductively look around, sort of as you're describing. Then, if you find something puzzling, construct a theory that explains that puzzling finding. That theory will have other empirical implications, and you can test the theory by looking at those other implications.

This is not remotely the same thing as the casual empiricism you describe. Theory is absolutely central to any sort of remotely social-scientific work, which is to say just about anything that offers an explanation for human behavior.

and can use the AIC or BIC rankings to determine which model best explains the data

Fitting the data is only very rarely the goal, or even an appropriate goal. More commonly, the goal is or should be to test a theory. When you're testing a theory, you include the variables relevant to that theory. It doesn't matter if you could improve the AIC by dropping a variable, because that variable is an intrinsic part of the theory you're testing.
posted by ROU_Xenophobe at 7:22 PM on May 7, 2008


Response by poster: Jimbob: "Won't I miss such gems of discovery if I assume the key dependencies already?

Assume the lotus position and repeat after me. Correlation is not causation. Correlation is not causation.
"

What if I don't care about causality? I just want to know that when its less than 18C on a cloudless day and the humidity is greater than 80%, there is less likely to be snakes on my walk? I mean who cares why? I just don't want a snake to bite me.

Regarding adding variables incrementally, it's kind of strange. Sometimes there is a positive effect of one variable on another but then when I add another variable, the first variable's effect becomes negative (presumably because theres some co-dependence on the first and second variable). There's no way to plug the data all in and get some sort of canonical set of variables?
posted by vizsla at 10:48 PM on May 7, 2008


What if I don't care about causality? I just want to know that when its less than 18C on a cloudless day and the humidity is greater than 80%, there is less likely to be snakes on my walk?

Causal theories and theory-testing is how you differentiate between bullshit random associations and real stuff.

There are lots of empirical relationships in the world. There's still a strong relationship between which league wins the World Series and which party wins the White House -- when the American League wins, the winner will probably be Republican. There's also a strong relationship between the Redskins winning the week of the election and the incumbent party winning. These are real empirical relationships. If they were in your data, you would find them.

But of course there's not really anything going on, and these are just coincidences. If you modified your behavior as a result of them, you would be an idiot. How do we know that these are just coincidence? Because there is no possible causal connection between the two. Because there is no way for World Series wins to affect presidential elections, and there is no other variable that both causes the American League to win and causes Republicans to win.
posted by ROU_Xenophobe at 7:06 AM on May 8, 2008


Regarding adding variables incrementally

This is called stepwise regression, and it also makes baby Jesus cry.

it's kind of strange. Sometimes there is a positive effect of one variable on another but then when I add another variable, the first variable's effect becomes negative (presumably because theres some co-dependence on the first and second variable).

It might be because the relationship changes when you control for the new variable. This is a good reason to just regress the model you have in mind and not commit the sin of stepwise regression.

Or it might be because you're regressing shit against shit, so of course the signs are going to flip, even for that 5% of random shit that turns out to be .05 significant.

There's no way to plug the data all in and get some sort of canonical set of variables?

In the sense of "These are the variables that affect Y, and all the variables that affect Y, and none of the variables that don't affect Y?"

No, there is no way to do that, because affecting Y is a causal process. More broadly, there is not any way on God's green earth to just throw everything at some regressions and get back the actual data-generating process.
posted by ROU_Xenophobe at 7:14 AM on May 8, 2008


It doesn't matter if you could improve the AIC by dropping a variable, because that variable is an intrinsic part of the theory you're testing.

True, but if you have two theories, one without a variable and one with, and the theory without the extra variable "wins" in the multi-model comparison, then you have more support for that theory.
posted by Jimbob at 2:31 PM on May 8, 2008


Not really. You have yes theory, which says the variable matters, and no theory, which says it doesn't.

You see that the variable is significant and in the right direction for yes-theory.

That supports yes-theory and is evidence against no-theory. Even if the model without the variable has a better AIC or adjusted R2 or whatever. Why? Because the goal is not to fit the data, it is to test the predictions of the two theories.
posted by ROU_Xenophobe at 9:57 PM on May 8, 2008


I agree whole-heartedly with the responses that came before. However, for the sake of looking at both sides, there is another way to look at blind approaches to regression. The actuarial approach is that anything that predicts matters, so you do throw everything in (I assume all at once) and use whatever variables fall out to predict. This is not model or theory- testing though, it’s what insurance companies do to minimize risk. They don’t care WHY drivers under 25 have more accidents, they just want to know that they are riskier and so they charge drivers under 25 more for insurance.

If you really don’t want to be bitten by snakes, then why not avoid your sidewalk on Tuesdays when it’s below 80 degrees and the American League has recently won the World Series? But if you want to say that ice cream sales will go up in that same instance and you plan your business tactics around that, you’re an idiot. If you want to build a psychological theory of consumerism, your publication would (or at least should) get rejected by peer-reviewed journals.

What I’m saying is that there is a time and place for blind empiricism, mostly relating to risk-management (in a Chicken Little kind of way). If you want to explain anything or understand the world better, empiricism leads to fool’s gold.
posted by parkerjackson at 8:56 AM on May 10, 2008


« Older 40GB or 80GB. That is the question...   |   How do I report this mistake? Newer »
This thread is closed to new comments.