determining influence of factors
January 9, 2007 11:47 AM   Subscribe

StatisticsFilter: I have a set of factors and need to determine which have the most influence and which the least: not on a numerical output but on each other

So, for example, let's say I have four factors: A, B, C, and D. I want to know if B affects the probability of A, C, and D more than C affects the probability of A, B, and D. Except for hundreds of factors, not just the four. I think this is what linear regressions are for, but since there's no one output to score, I'm not sure if there's a way to do that here.

My meager understanding of statistics suggests starting with a test of the independence of each factor. So, starting with B, comparing p(A) against p(A|B), p(C) against p(C|B) and so on, and using these results to somehow get a score for the influence of B.

If the answer is very involved, a link to the general method of solving this kind of problem would be appreciated. Thanks!
posted by rottytooth to Science & Nature (7 answers total)
 
Best answer: The simplest statistic would be the correlation coefficient. Excel and every statistics package has a canned correlation function.

In your example, the result would be a 4 x 4 table with a correlation coefficient between -1 and +1 in each cell. If your using Excel, I would use conditional formatting to highlight cells that exceed a particular threshold, such as +/- 0.75. That way you can visually see which variables are most correlated.

While correlation coefficients are useful, they tend not to produce very useful statistics for hypothesis testing purposes.

A seminal article on using correlation coefficients two determine relevant markets is George J. Stigler and Robert A. Sherwin, "The Extent of the Market," 28 Journal of Law and Economics (1985).
posted by GarageWine at 12:45 PM on January 9, 2007


Best answer: Do you have hundreds of factors or hundreds of variables?

If you have a whole bunch of variables, and want to see if some of them clump together in interesting ways, you'd be in need of some Factor Analysis.

Whatever stats/database program you're using should have the ability to do it, if you hunt around for it. I don't have any specific instructions on hand, but you could check out some intro-level stats books
posted by CKmtl at 12:46 PM on January 9, 2007


Best answer: It seems to me that you're missing the first two steps that should underly any statistical analysis.

First, what are you trying to find out? What are these hundreds of variables or factors, and why do you (or the people you're working for) give the slightest damn which is the most strongly associated with the others? If you don't have a clear idea of what you're trying to learn, you're not going to learn it.

Second, you should have a theory. Throwing everything against everything else is not a good way to learn valuable information -- it's a good way to have a few nuggets of useful information mixed up with a whole bunch of spurious relationships with no way to sort the useful information from the spurious crap. Having a theory with testable implication is how you find the grain in all the chaff.
posted by ROU_Xenophobe at 1:18 PM on January 9, 2007


Best answer: GarageWine has it: you need to construct a correlation matrix. Every stats program I've ever used will do this for you fairly easily--especially SPSS.

ROU is right in that this type of analysis alone is not very useful. It can be a good first step, though--if you identify relationships between variables that you wouldn't expect, trying to figure out the WHY behind the correlation can be fun and potentially rewarding research.
posted by jtfowl0 at 2:06 PM on January 9, 2007


Response by poster: Awesome, thanks everyone.

Yeah, I'm starting off just trying to determine the most influential variables (so that for each new case they can be determined first). But eventually I do want to work toward theory, so once I have the correlation matrix working, I'll start reading up on factor analysis.
posted by rottytooth at 2:58 PM on January 9, 2007


Okay, but:

A correlation matrix will tell you about statistical importance, but won't say a damn thing about substantive importance. A can be highly correlated with B, but still be utterly unimportant in determining B. Likewise, the most important thing in determining B might be correlated with it at only a low level.
posted by ROU_Xenophobe at 3:49 PM on January 9, 2007


Listen to ROU_Xenophobe, for he speaks the truth. If you go hunting through a huge correlation matrix looking for big numbers, you're likely to find a bunch of interesting relations that are due to nothing more than the quirks of the sample you're looking at. Sure you'll be able to come up with a good story to explain the relations, but that's worthless if what you're explaining is in reality a fluke. If you go in with a few specific questions in mind, you'll greatly reduce your chances of finding spurious results.

By the way, it sounds like your variables might be dichotomous -- that is, either A occurred or A did not occur (1 or 0). If that's the case, it introduces some complications to any analyses you might do and you ought to do some reading or googling on binary/logistic/logit models before you proceed.
posted by nixxon at 6:39 PM on January 9, 2007


« Older How do I evict my ex-girlfriend?   |   organizing written notes electronically Newer »
This thread is closed to new comments.