linearity without the scatterplot
February 28, 2009 2:48 PM   Subscribe

How do I check for linearity?

Is it possible, without ever drawing a scatterplot, or trying to interpret one, to know whether there's a linear relationship between two variables?

I have 2 columns in excel with numbers for the dependent and independent variable and I would like with 100% certainty to determine whether they have a linear relationship for purposes of regression.
posted by ttyn to Computers & Internet (9 answers total) 1 user marked this as a favorite
Best answer: Yes, calculate the correlation coefficient between the two columns. I don't know the Excel function for it off the top of my head, but I'm sure there is one; look in the help files for "correlation coefficient".
posted by Flunkie at 3:01 PM on February 28, 2009

Response by poster: +/- 1 being totally linear?
posted by ttyn at 3:02 PM on February 28, 2009

posted by Flunkie at 3:03 PM on February 28, 2009

Well, and conceivably 0 is linear as well, but only if at least one of the columns only has one possible value.
posted by Flunkie at 3:05 PM on February 28, 2009

Best answer: The correlation coefficient will tell you how strong a linear correlation is, and is commonly used for the purpose you mention. To be more correct, F-test can tell you how well a linear regression model fits your data.

There is no 100% certainty in statistics, but an F-test might reveal that the probability that the model does not fit well is, say, less than 10–16.

Excel's statistics functions are notoriously buggy (web page down right now, see the Google cache).
posted by grouse at 3:09 PM on February 28, 2009 [1 favorite]

Best answer: I don't know how set on using Excel you are, but here's a quick example in R. 'seq' just generates sequences 1,2,3,... for the first, and 0, -10, -20, ... for the second. 'lm' is linear model, y ~ x is 'y varies with x' (y is your data, x is your model), summary just gets a bunch of useful stuff about it.

R reads in .csv files, which can be exported from Excel very easily, with read.csv(). Just another option.

> x = seq(from=0,to=100,by=1)
> y = seq(from=0,to=-1000,by=-10)
> summary(lm(y~x))

lm(formula = y ~ x)

Min 1Q Median 3Q Max
-1.475e-13 -7.112e-14 -3.273e-14 1.264e-14 2.289e-12

Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.335e-13 4.890e-14 1.296e+01 <2> x -1.000e+01 8.449e-16 -1.184e+16 <2> ---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.475e-13 on 99 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.401e+32 on 1 and 99 DF, p-value: < 2.2e-16
posted by devilsbrigade at 3:27 PM on February 28, 2009

Response by poster: I love you all!
posted by ttyn at 3:40 PM on February 28, 2009

In a correlation, +/- 1 would mean totally linear ***AND*** without error -- all the points fall on a line. If the relationship itself is linear but the data are noisy, then your correlation might be .8 or .7, depending on how big the error term is. And data from the real world are almost always pretty noisy, especially data that relate to human behavior.

Is there some pressing reason why you don't want to toss up a scatterplot? If you're not in excel, doing so is trivial. "scatter y x" in stata, or "plot(x,y)" in R. Up it pops, big as life and twice as ugly.

Likewise, is there a pressing reason why you need to check for linearity before you run the regression instead of by running the regression or after running it by scatterplotting the residuals?
posted by ROU_Xenophobe at 6:00 PM on February 28, 2009

Is it possible, without ever drawing a scatterplot, or trying to interpret one, to know whether there's a linear relationship between two variables?

Nope. You GOTTA look at the data. The numbers are meaningless. Check this out:
posted by tiburon at 7:54 PM on March 1, 2009 [1 favorite]

« Older Eggs, Toast, and MasterCard   |   Griffin Air Click - any way to make it work or... Newer »
This thread is closed to new comments.