Statistical Analysis Help
May 15, 2007 6:33 PM   Subscribe

Simple Stat Filter: I'm trying to do a statistical analysis of correlation between two value fields, but all the simple tests that I know are inapplicable. Help me MeFis!

Here's the basic setup of the conditions: I have survey data from individuals. I am trying to draw a connection between two sets of data that individuals inputed (one category they enter a number from 0-10, the other they put in a number from 0-25). I hypothesize that the higher the number in field 2 (eg 10), the higher the number for field two (eg 25). Conversely, the lower the lower (Perhaps not linear, but directional?). So I have 35 sets of the two data values, and I've tried but failed with T tests and Chi-squared.

How do I need to do this? Should I submit the data on here? Is it possible to do this test without an "expected" value structure, which I think would be artificial?

Thanks a bunch
posted by stratastar to Education (11 answers total) 1 user marked this as a favorite
 
It sounds like you should be able to learn something just by estimating the correlation coefficient, which is defined as the covariance over the product of the standard deviations. This has the effect of imposing a linear model on your data, but you have to make assumptions to reduce a cloud of points to a single number.

What software are you using? In Excel, the CORREL function will calculate the correlation coefficient for columns you give it.
posted by grobstein at 6:49 PM on May 15, 2007


Response by poster: I could use Excel, but I'm more familiar with my Ti83. So by estimating the correlation coeffecient, I would be imposing a linear functionality (2.5x) and then test for variance from that linear function?
posted by stratastar at 7:07 PM on May 15, 2007


Like grobstein said, it's probably a good idea to do a linear regression and find the correlation of the two data sets. Then you can do an inference test to decide if the slope of the of the regression line is 1 or not (free response question 6c).

Year-end AP Stats project? My students are starting their on Monday :)
posted by msittig at 7:07 PM on May 15, 2007


I'm sorry, test whether the slope of the LSRL is different from or greater than 0.
posted by msittig at 7:08 PM on May 15, 2007


I'd recommend linear regression. Say you regress field 2 on field 1. The coefficient on field 2 tells you the effect of a one-unit change in field 2 on field 1. The p-value associated with that coefficient tells you if this is statistically different from zero.

Practically speaking, I would also banish from your mind any worries about non-linearity. Regression is fantastic at giving you the right answer even in the presence of minor hiccups such as that.
posted by shadow vector at 7:32 PM on May 15, 2007


Finding the correlation coefficient has the effect of telling you the fit of a simple linear model; you don't have to run any tests on it (it's just a scalar value, what could you do?).

But msittig (who teaches this stuff! yay!) is totally right that you can estimate a linear regression model, and test the fit (t-stat of the single coefficient seems like a good way to do it, I think).

On preview: yeah, and sv is right about the non-linearity
posted by grobstein at 7:39 PM on May 15, 2007


The significance of the correlation coefficient is proportional to sample size so treat with caution. Is the data approximately normally distributed? If so, and the data is a continuous variable (i.e 1-2 = 3-2 = 4-3 etc.) then you want the Pearson Product Moment correlation, otherwise the Spearman Rank Correlation. If the latter, you can't (easily) show the regression line.
posted by singingfish at 7:53 PM on May 15, 2007


nth linear regression. It will do what you want. Test if the coefficient associated with the independent variable is statistically different from zero. Also, the sign on the coefficient will tell you if the two variables are positively or negatively related.

Why are you worried about non-linearity? I don't think you need to sweat it, but if you are worried (and want to have an extra-good stats project), just take your independent variable, square it, and include that in your regression. Magically, simple linear regression has become multiple regression! If the t-stat associated with the squared term is statistically significant, your data are not linear. You can put in as many higher order terms as you like (each higher order term lets the regression line change direction one extra time), but if the squared term isn't significant, no higher order terms will be either. Good luck!
posted by jtfowl0 at 7:57 PM on May 15, 2007


Oh, for goodness sake, just make a scattergram for each question and look at the results. (Fortunately, your Ti83 has graphing capability.) If the points are randomly distributed over the page, why bother with statistical tests? If there's a visible trend, linear or nonlinear, then it's worth pursuing.
posted by exphysicist345 at 8:22 PM on May 15, 2007


As you pointed out, you shouldn't necessarily hypothesize, even if one scale goes from 0-10 and the other from 0-25, that the second will on average be 2.5 times the first. There might be a different linear relationship, one with a nonzero intercept and a different slope. (Additionally the relationship could be nonlinear, but with this little data as shadow vector said that isn't practical).

So you want to do a linear regression, then find r, the correlation coefficient. In order to find out p of a given value of a correlation coefficient, you need to go from the coefficient to a T-statistic:

t = r / (sqrt((1-r^2)(N-2), which follows a T-distribution with N-2 degrees of freedom.

Now you can do whatever significance testing you want.
posted by goingonit at 9:27 PM on May 15, 2007


Hmmm.....

Before you start employing inferential statistics (e.g. T-tests, Chi-squared, significance levels, etc.), you need to explain how you gathered your data, and ask yourself why such statistics would be appropriate.

Inferential statistics assume a model that somehow generates randomness. The usual example is a random sample taken from a population, from which you are trying to draw inferences about the population.

But if the individuals in your study were not selected using a random sample, then exactly what is the model you're using? And where does the randomness come in?

If you can't answer these questions, then all this talk of T-tests, significance, and so on, is utterly meaningless.
posted by mikeand1 at 11:08 PM on May 15, 2007


« Older I'd sort of like to be a mechanic   |   Optics n00b question Newer »
This thread is closed to new comments.