Statistics Filter: Help me determine whether I can predict B from A.
May 3, 2010 9:00 PM   Subscribe

Teach me - like you would a second-grader - enough about statistics, correlation, and/or regression analysis to solve a specific problem.

Assume I have two parameters for approximately 5.5 million entities: (A) a score (from 0 to 1000) and (B) a boolean that indicates something else about that entity that may or may not be related.

How do I determine whether or not there's any statistical relationship between (A) and (B)? That is, I'd like to know whether it's feasible to infer (B) from (A), and if so, how strongly (ideally via a single metric) I might be able to do so.

What (public-domain) tools out there might be able to help me determine the answer to this question? (If it helps, (A) and (B) are fields in a mysql database.)
posted by aberrant to Science & Nature (6 answers total) 8 users marked this as a favorite
Here is a chart to help you pick a test. The easiest (that I think is appropriate) would probably be a Mann-Whitney U test. (There are a couple of implementations listed, notably ALGLIB.)
posted by anaelith at 9:44 PM on May 3, 2010

In this case, your dependent variable (B) is a dummy variable (i.e., it only has two values) and you have one independent variable (A) which is continuous. If you have a dummy dependent variable, you can use logistic regression. Here’s a java applet that demonstrates how the actual calculation.
posted by Jasper Friendly Bear at 10:43 PM on May 3, 2010

Are these entities sampled independently from the same population?

If so, and your goal is to create a prediction model for the boolean outcome (infer (B) from (A)), your best bet (not necessarily the very best model, but the one most feasible for you) is to create a logistic regression model. This is a model in which the log-odds of the boolean outcome are modeled as a linear function of terms including the dependent variable (the score) or any suitable mathematical transformations of the score. Each term is weighted with a coefficient, which is determined based on a best fit of your data set (the way that fit is figured out is often using a method called maximum likelihood estimation). Most statistics packages can do all the heavy lifting for you and pop out a model with coefficients and all sorts of goodness-of-fit statistics, but I have to admit that this sort of stuff is really not second-grader appropriate, and one should tread a bit lightly into it without requisite education/training or guidance.

Also knowing more about the specifics of the score and the boolean would be helpful in creating a model. For example, do you have reason to believe that if an association exists, it is monotonic?

With a sample size as massive as yours, you can then use any of a number of cross-validation approaches that will provide metrics for how good the model is.

I don't know if you're looking for something more or less serious/robust then this, but feel free to MeMail me if you think I can help.
posted by drpynchon at 10:50 PM on May 3, 2010

This is a classic machine learning problem. As others have suggested, logistic regression might be an answer. You're lucky in that your scores are one-dimensional.

First though you want to explore your data. What's the mean score for all of your 'A' entries?
If you divide the 'A' scores into those with B==1 and B==0; what's the mean of A0 and of A1? What are the standard deviations? What does it look like if you plot it?

You get some kinds of data where transforming it first helps (for example, if your A1 values are all around 1 and -1, but your A0 values are all around 0, it is hard to separate A1 and A0 in the original space. But it's trivial if you square them first). For this reason starting with a data exploration and visualisation step is a good idea - if you look and see that all the A0s are over 10 and all the A1s are under 5, you can separate them with a simple threshold.

Weka is a very useful and powerful open source machine learning package, which is probably a sledgehammer to crack a nut, but will certainly have the tools for the job. It'll also do cross validation for you automatically if you ask for it.
posted by handee at 12:28 AM on May 4, 2010

On the one hand, you're lucky to have 5.5 million observations.

But on the other hand, they're going to cause you problems -- with 5.5M observations, I can guarantee you that you're going to find a statistical relationship between A and B. With very large datasets, the usual advice is to not care very much about statistical significance, as almost any relationship will be statistically discernible, and to go straight to substantive significance.

Public domain tools that will handle 5.5M datasets probably limits you to R. If you can spend other people's money, interpreting this sort of thing is easier in Stata.

First, clean your data to get rid of observations that don't have a B or don't have an A.

Then do whatever to get the data out of mysql and into R. This will probably mean dumping the data to a csv and using R's "read.csv" command:

data <>
(You may need to add options to the read.csv command depending on your dataset)

Then type:

outputobject <>
This will run the model, but it won't show you anything. So poke the output object:


And the bit you want to look at is "Coefficients:" Look for the row labeled "a". The estimate is the estimate of the effect of a on b, but interpreting it takes more work. Look over at the Pr > |Z| column or something similar to that; you will probably see a very small number (or just 0) there. If you see a number there bigger than 0.05 or 0.10, stop; there is no clear relationship between A and B, and you could bet your car on that.

If the Pr>|Z| is smaller than 0.05, then you have a statistically significant relationship but not necessarily one of any substantive importance. So you really want to look at the substantive effect of A on B. The catch is that this is a little bit tricky in a logit model; for technical reasons I or someone else can describe if you want, there's an intermediate step between the raw coefficient and the probability of B.

Anyway, do this:

(1) Get and install a copy of gnuplot.
(2) Start it and do this, where b0 is the estimate for the intercept and b1 is the estimate for the coefficient of a:

set xrange[0:1000]
plot logit(x)

And that will pop up a plot of how the probability of B increases or decreases as you increase A. You can then interpret that plot directly to assess whether the effect of A is substantively important or not, and what the critical range of A that really drives the probability of B is (if there is one).
posted by ROU_Xenophobe at 6:51 AM on May 4, 2010

(the R command stuff got munged by the anti-html thingie. memail me if you want)
posted by ROU_Xenophobe at 9:44 AM on May 4, 2010

« Older Help me mold our no tv life   |   How do I become a better writing tutor/coach for... Newer »
This thread is closed to new comments.