Trying to find relationships in numbers
March 14, 2006 2:06 PM   Subscribe

How can I have my computer analyze a small dataset to find patterns?

Here is a very crude, semi-entertaining analogy of what I'm trying to do (completely unrelated to my actual project, trust me).

Say I'm sitting in a cubicle and I keep smelling an odor. I decide to graph a bunch of factors that might be related. The data would look like this, commas separating what was noted each hour:
(Time): 10am, 11am, 12pm, 1pm, 2pm, 3pm, 4pm, 5 pm
Toilet flushes heard: 2, 1, 1, 6, 2, 1, 4, 3
Taco Bell meals brought in: 1, 1, 3, 4, 2, 1, 1
Boss's dog in room: 1, 0, 0, 1, 1, 0, 1, 1 (1=in room, 0=not)
Coworkers in room: 5, 2, 1, 6, 3, 2, 2, 4
Febreze squirts heard: 2, 4, 1, 1, 3, 2, 3, 5
Odor strength: 5, 2, 6, 3, 0, 3, 6, 4

I would want to have the computer match the different sets with one another, trying different normalizations and so forth, to determine what factor results in the smallest standard deviation with "odor strength"... i.e. what factors might be most at play (especially in combination). Maybe it's the Nacho Supremes and no Febreze, not the activity in the restroom.

Obviously in Excel I can look at all this data and graph it, but it's tough to compare sets. Is there a cheap shareware program that can do stuff like this? I'm not looking for Mathematica or anything pricey... this is just for an experiment I'm doing.

(and yes, the correct solution for the analogy is "quit your job!")
posted by rolypolyman to Science & Nature (27 answers total)
hmm... This is a pretty standard AI problem. It looks like you might be able to generate a decision tree with C4.5, although you might need to reformat your data.

weka is a suit of AI tools you can use, and it's open source. I've only really heard of people using it as a platform for other AI research, although it does include a lot of basic functionality.
posted by delmoi at 2:20 PM on March 14, 2006

actually, this looks like great application for a baysian network. Keep in mind that some of your data items might depend on eachother (like toilet flushes and squirts of fabreeze)
posted by delmoi at 2:23 PM on March 14, 2006

it's not strongly correlated with any of the variables, although febreeze has the relatively strongest correlation (mercifully, negative). In other words, there is no statistically reliable correlation with any variable. Interestingly, the presence of co-workers seems to improve the air quality!
posted by adamrobinson at 2:24 PM on March 14, 2006

Just FYI, what you're asking the computer to do is really hard. This is an area where the human brain excels.
posted by knave at 2:26 PM on March 14, 2006

PS: it's too small a data set to begin to draw conclusions about various pairings of variables. Occam's razor + basic statistical inferences = no reliable conclusions. Beware of spurious correlations between variables (i.e., in any random group of variables, correlations will appear between various variables, but that does not mean there's any causal connection).
posted by adamrobinson at 2:27 PM on March 14, 2006

Just FYI, what you're asking the computer to do is really hard. This is an area where the human brain excels.

You are completely wrong. This is very easy for a computer to do, much easier then for a person. It's just not a very common problem in everyday life, so there are no general, easy to use programs to do it. These kinds of things are used all the time in industry and research.

For example: looking at gene expressions in certan conditions. Very routine.
posted by delmoi at 2:40 PM on March 14, 2006

You could try inputting the different sets into a neural network with odor strength as the result in an attempt to "train" it to predict order based on the input factors.
posted by vacapinta at 2:40 PM on March 14, 2006

So it sounds like you want to run a bunch of ANOVA-ish analyses. SPSS does
this well and is quite user friendly. You can get a free month download from their
at their website (at least you could last summer). Otherwise Scilab is good (and free)
clone of Matlab. It doesn't have much built in, but if you know a little linear algebra
and programing you can do it in Scilab.
posted by thrako at 2:47 PM on March 14, 2006

You want to explain odor strength, measured at different levels. You have a bunch of variables you think might have something to do with odor strength. Right?

Assuming your small dataset isn't super-small, this seems like a job for plain-old multiple regression. Though if you don't have many more observations than you do independent variables, your standard errors will be big.

I think you can beat Excel into doing a regression, but there's free stuff out there that will do it more nicely. GRETL , if you want something point and click.

A neural network or Bayesian network seems like swatting a fly with Mjollnir, unless you need to do something like that to deal with a low-N problem.
posted by ROU_Xenophobe at 3:35 PM on March 14, 2006

Or just start with a correlation matrix. GRETL might do that; I dunno.

R will do it, one way or another. R will do anything you want it to, but it's big and scary. But free!
posted by ROU_Xenophobe at 3:40 PM on March 14, 2006

For a dataset that small, just graph each & look at them. from looking at it though, I seriously, seriously doubt you'll find any correlation that won't be completely off the wall.
posted by devilsbrigade at 3:53 PM on March 14, 2006

Holy ass, people; AI, nueral networks, Bayesian analysis? Have none of you people ever heard of a little thing called statistics? This is what statistics was designed to do: What is the corelation between observed independant variable x and measured resultant dependant variable y? I mean, were it just Taco Bell and stankiness you could do this on a TI-84 in about 2 seconds.

As it is, I think ROU has it, this is a multiple regression. The idea of regression is to try to predict a certain measured value given a whole bunch of external factors. By far the easiest and most common of these is the linear regression, which assumes that the variables influence the dependant value in linear combination:

skanyness = a * taco bell + b * toilet + c * dog proximity ...

By inserting a whole mess of observations into the above equation and then solving for a, b, c, ... you can get a general idea of how each variable influences stankiness. Again, as ROU mentioned, this is why you ideally need many more observations than independant variables. If you do it right, you'll get something in the form of:

stankyness = 12 * taco bell + .5 * toilet + (-2) * dog ...

In which case you can see that taco bell is extremely positively linearly associated with stankiness, while the toilet barely so, while the dog actually makes things smell nicer, because the owner washes him regularly.

To further optimize this, you can conduct observations over a couple days (i. e. take a number of 'samples') and then normalize across them to get an even more accurate picture.

ANY basic statistics program out there will be able to do linear regression analysis. MINITAB one such package, and is free for educational purposes, so if you or anyone you know is at all connected to a university, you can probably get it for free. Its possible even Excel can do it. Hell, I'd say its even likely that Excel can do it. Just F1 and search the help index for 'linear regression'.

Should you decide to get fancy, there are many more flavors of regression analysis out there than just linear.
posted by ChasFile at 5:24 PM on March 14, 2006

GRETL is fine for basic stuff, and free. I've used it successfully and it's what I recommend to people who aren't going to be doing enough statistics to justify learning R. The only real annoyance I've found with it is an intolerance of missing data.
posted by ROU_Xenophobe at 5:38 PM on March 14, 2006

unless you're going to use non-parametric stats, you need to sit down and think clearly about the models you're using - what the numbers "mean". it's critical you do that before you start messing around with random formulae. both because the random formulae will give yo umeaningless results, and because thinking about things like that will lead you to a solution.

here's the way i'd do it:

since there's no real "physical" model here i'd choose something to make the maths easy. i'll say that to get from any set of observations of "causes" to the set of obervations of "smells" you can apply an arbitrary scaling (to correct for different scales used to measure the two effects). i'll also say that we expect the prediction (after scaling) to have the an error that's constant over time (ie independent of the predicted value), the same for each cause, and distributed normally.

so for cause i we have a scaling k_i and the model for time j, m_ij is k_i c_ij (where c_ij is cause i at time j, etc).

now you wanted to find the "best" combination of causes to explain the smell. by best, we mean likelihood (and because i picked such a simple model this is just least squares (and for the same reason it's also bayesian if you object to likelihoods)).

in other words, if the smell at time j is s_j, we want to minimize:

L = sum_j [(s_j - sum_i k_i c_ij))^2]

by choosing the k's.

so let's differentiate L wrt k_i:

dL/dk_i = sum_j [2 s_j c_ij - cij sum_l c_lj k_l ]
= 0 at maximum

you can rewrite that as a matrix equation and solve it. i don't trust the <pre> formatting here, so i won't try to show the working, but if S is a "horizontal" vector of n_j observations (8 in your case), C is basically your matrix of observations above, each different cause (dog, etc) is a different column, and n_i (5 in your case) columns, K is a vertical vector with n_i (5) values, then SC = CKC.

if you solve that for K (it is 5 simultaneous linear equations for k_i) then you're done.

looking at the RHS...

CK(j) is c_ij k_i and then CKC(l) is c_ij k_i c_lj (as above, duh) so that's equivalent to c_ij c_lj k_i or C^TC K. there must be a reason for that, but i can't remember why.

C^T is (i added a zero second row at end):
2 1 1 6 2 1 4 3
1 1 3 4 2 1 1 0
1 0 0 1 1 0 1 1
5 2 1 6 3 2 2 4
2 4 1 1 3 2 3 5

C is:
2 1 1 5 2
1 1 0 2 4
1 3 0 1 1
6 4 1 6 1
2 2 1 3 3
1 1 0 2 2
4 1 1 2 3
3 0 1 4 5

Their product is
72 39 17 77 50
39 33 8 44 24
17 8 5 20 14
77 44 20 99 64
50 24 14 64 69

S is:
5 2 6 3 0 3 6 4

SC is:
75 46 18 87 71

so we can solve for K:
k1 = 84598/213001 = 0.4 (toilet)
k2 = 126632/213001 = 0.6 (taco)
k3 = -67730/213001 = -3.2 (dog)
k4 = -9197/213001 = 0.0 (coworker)
k5 = 136099/213001 = 0.6 (febreeze)

we can check that by calculating CK, which is the model:
-0.6 3.4 2.8 2.2 0.6 2.2 0.8 1
5 2 6 3 0 3 6 4

and that matches so poorly i think i've screwed up somewhere. the idea, in case it's not obvious, is that your smell is 0.4 toilet, 0.6 taco etc. and you can solve for this exactly if (as i tried to do above) if you use a "nice enough" model. note that you need as many (or more) time intervals as causes, of course (otherwise you have more unknowns than measurements).

anyway, all is used was: to solve the final equation and to do the multiplication.

[on preview - what the person above said]
posted by andrew cooke at 5:39 PM on March 14, 2006

(I should, in fairness, add that AI, neural networks, and Bayesian analysis all do have their place in regression analysis, especially as they relate to non-linear regression model optimization in the first two cases and probabalistic regression model estimation for the last. But immediately suggesting these solutions is like giving instructions on how to operate a nuclear-powered submarine to someone who's asking how to row a boat.)
posted by ChasFile at 5:46 PM on March 14, 2006

What andrew cooke very cogently demonstrates above is a linear regression analysis. All the confusing-ass calculus followed by even-more-confusing-ass linear algebra that he does is what SPSS, SAS, MINITAB, Excel, or any other statistics package does behind the scenes for you.
posted by ChasFile at 6:03 PM on March 14, 2006

while it is a linear analysis (or tries to be - i wish i knew what i did wrong), i suspect that the first thing people are going to see when they look for that phrase in a book or manual is something that fits a straight line (y = ax + b). you might be better with multiple regression (in that case b0 = 0 and P, Q, R are taco, dog, febreeze etc) (it's still linear, it's just linear in several variables).
posted by andrew cooke at 6:25 PM on March 14, 2006

I imagine Andrew remembers this, but in case he doesn't, the matrix form of OLS is b = [x'x]^-1[x'y] -- that is, [(x prime x)inverse (x prime y)]. I can't be arsed to check if that's what's already up.

rolypolyman: unless you have a very small dataset that you can't easily expand, you probably want to do a regression.

Running a basic regression is easy and doesn't require you to know any matrix math, though that is what the computer is doing behind the scenes. Likewise, interpreting the results you get is not particularly difficult though it can appear a bit daunting at first. That said, if there are consequences to getting it wrong, you might need to go beyond basic multiple regression.
posted by ROU_Xenophobe at 7:35 PM on March 14, 2006

thanks - i could never remember results like that, but it seems to agree with what i have, so i must have messed up the numbers somewhere.
posted by andrew cooke at 8:24 PM on March 14, 2006

What you want is a multiple regression analysis. I don't know of any free stuff available but any decent stats program (Statistica is what I use) will do this for you. All a bunch of simple correlations will tell you is the relative relationship between which ever of the two variables you are correlating. Correlations *never* tell you about causation and neither will the multiple regression. The multiple regression analysis at best will tell you which of your variables is contributing the most to any systematic relationships among the variables (systematic does not imply causation, only relation).
posted by bluesky43 at 5:28 AM on March 15, 2006

I don't know of any free stuff

GRETL. R (but that's overkill).
posted by ROU_Xenophobe at 8:19 AM on March 15, 2006

This kind of multiple regression is actually very straightforward in R.
If you know anything of object oriented programming languages then R should be reasonably intuitive to learn.

Using the madeup data you have provided there is no association between any of these factors and the smellyness.
Bear in mind that the more factors you want to put in your model, the more data you need. For multiple regressions you need LOTS of data, especially if you want to include the possibility that there may be interactions between your variables. This is because each estimate you make (intercept/slope) uses up a degree of freedom.

#The code
toilet_c(2, 1, 1, 6, 2, 1, 4, 3)
tacobell_c(1, 1, 3, 4, 2, 1, 1,1)
dog_as.factor(c(1, 0, 0, 1, 1, 0, 1, 1))
coworkers_c(5, 2, 1, 6, 3, 2, 2, 4)
febreze_c(2, 4, 1, 1, 3, 2, 3, 5)
odor_c(5, 2, 6, 3, 0, 3, 6, 4) # this is your response


model2_step(model) #stepwise elimination of non-significant terms

#the output

lm(formula = odor ~ 1)

Min 1Q Median 3Q Max
-3.625 -0.875 -0.125 1.625 2.375

Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.6250 0.7304 4.963 0.00163 **
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.066 on 7 degrees of freedom
posted by jonesor at 9:24 AM on March 15, 2006

Um, you guys the example data is an example of the kind of data he needs to analyze, it's all made up, so I wouldn't expect it to necessarily have a reasonable answer.
posted by delmoi at 10:18 AM on March 15, 2006

Of course delmoi. I was just demonstrating how one might go about doing this analysis in R to show how easy it is...:)
posted by jonesor at 11:06 AM on March 15, 2006

um, you don't know maths stops working when numbers are made up? what's a squiblion plus 2? eh?

incidentally, you might find a more significant answer if you dropped the zero point shift (the "intercept"). i have no idea why it's in your model (i suspect just because it's the default, which is why i have my doubts about recommending such packages unless you know people understand what they do....)
posted by andrew cooke at 11:13 AM on March 15, 2006

a) obviously this particular multiple regression is linear on theta - the relationship between the variables - not linear on x - the relationship between the variables and y. I understand that people might get that confused when first encountering it. I was simply trying to point out that one easy way to do this is to try to solve a linear combination of the smells.

b) I don't understand why we must eliminate the intercept, or as you put it b0 = 0, andrew. Could it not be there there is some baseline smell, just some sort of stanky gestault that the other factors increase or lessen?
posted by ChasFile at 11:28 AM on March 15, 2006

With regards to the point concerning removing the intercept from a regression model (i.e. forcing the regression through the origin):
When doing no-intercept regression modelling, you make the substantive assumption that if the values of all the explanatory variables in your model are zero, then the value of the response variable must also be zero.
While this is valid in some cases (e.g. maybe the relationship between height and body mass) it is not appropriate in MOST cases - including this fictional case where there may be background smellyness.

I would argue that no-intercept regression should only be used in those cases where one could argue from a very strong theoretical position as to why the intercept SHOULD be removed from the model. To quote Draper and Smith (1998 - Applied Regression Analysis p.27) "...The omission of Bo (the intercept) from a model implies that the response is zero when all the predictors are zero. This is a very strong assumption, which is usually unjustified...".
posted by jonesor at 3:02 AM on March 16, 2006

« Older Fixing fonts in OS X....   |   Anybody remember this eighties kids TV show? Newer »
This thread is closed to new comments.