StatisticsFilter: Correlation over time?
December 11, 2007 12:40 AM   Subscribe

If I'm looking at existing statistical data, how do I try to find a correlation between two variables over time? Scatterplots have confused me, because they don't seem to reflect the passage of time, but just where x and y meet. Pearson's r doesn't seem to do it either w/r/t time.

So, what statistical method should I use? I'm not sure if the variables are independent or not, so that's another stew. At a series of given points in time I have an x and a y, but I need to understand how to know whether or not any correlation exists.

So data might look like this:

date x y
1/1 5 32
1/8 7 15
3/9 2 26

Am I underthinking this, or just ignorant? Does Excel already know how to do this and I just don't know it? No correlation is fine too, I'd just like to know I'm doing it properly.
posted by trondant to Science & Nature (13 answers total) 4 users marked this as a favorite
I'm not at all time-series analysis expert, but I think you want the cross-correlation between the two variables. Unfortunately, I don't know if excel has the ability to do cross-correlations. You may want to look at the statistical programming language R and its ccf() function to start, though I think that might not be the best choice if you aren't measuring your data at regular intervals.

A graph of the data is thankfully easier to create. If you convert the dates to integers (excel will do this automatically if you say the cells are numbers), you can see the relationship between the data points by plotting (date, X) and (date, Y) on the same axes. Plotting the data in this way should give you an idea of what's going on even if you never actually calculate the cross-correlation.
posted by eisenkr at 1:11 AM on December 11, 2007

If you want to manually look in the data for your correlations, your scatterplot shouldn't be y versus x. Instead it should be y versus t and x versus t on the same plot.

In the long run you will probably regret trying to do any serious statistics in Excel. It's been a long time since I dealt with its anemic and clunky statistical abilities, but I doubt it can deal with multivariate time series data.

There's a whole chapter on dealing with time series data (which is what this is called, for your Googling needs) in R in the book Modern Applied Statistics with S. Frankly, it looks really complicated.

It would help a lot to know a little more about the data.
posted by grouse at 1:25 AM on December 11, 2007

IIRC, a lot of how you handle these time series is dependent on whether or not you have data for evenly spaced time intervals. If the intervals are uneven, the analysis will be messy regardless of what statistical package you use.
posted by thisjax at 1:36 AM on December 11, 2007

Off the top of my head a crude way to deal with this is to look at y versus t and x versus t on the same plot. You could then use a chi squared statistic to see if the y/t series differs from the x/t series. You can also use the cell chi squared to see where the biggest differences in the points on the line lies.

I know very little about time series analysis, and so what I have to say is just off the cuff. I suspect that the problem with even/uneven intervals is more severe when the x and y variables are not measured at the same time.

Oh, you could also convert the time into number of days from 0 (t') and calculate the correlation between x/t' and y/t' separately then use the Williams test to test the significance of the difference of two independent correlations, again pretty crude, but maybe that's what you want. I guess the williams test then tells you if x and y are really independent or not.

Oh yeah, I really really really strongly discourage you from using Excel to do this stuff. R (mentioned elsewhere) is excellent and free, if hard to use, or get a 30 free trial of JMP which is excellent stats software. OTOH you're going to have to calculate the williams test by hand anyway ;-)
posted by singingfish at 2:12 AM on December 11, 2007

You could use different sized dots on a scatterplot for one of your current axis.
Here's an example, using Google Charts
posted by seanyboy at 2:26 AM on December 11, 2007

Your question is ill-specified. What are you trying to determine? Whether a correlation between x and y eventually develops over time? Or whether x and y and simply correlated?
posted by zaebiz at 3:54 AM on December 11, 2007

How accurately did you create the example data you gave? If you have only two data points per time period (one x value and one y), like in your example, then correlating those two variables will tell you how closely those two variables stick together over time. A scatterplot of this data should fall on a diagonal line if they are perfectly correlated. If you want to know if the data are steadily increasing or decreasing, you could use seanyboy's suggestion of different sized dots. Then you'd want to see a diagonal line of dots that increase consistently.

If your data actually has multiple data points per time period per variable, then this won't work as well. If you just want a rough idea of the correlation, you could use cross-panel correlations (i.e., correlate X at time 1 with Y at time 1,2,3, et.c; correlate X2 with Y2,3,etc.), but those correlations don't tell you anything about growth over time (only relative rank order). If you want to look at growth, you might use some of the exploratory methods Willet and Singer outline in Applied Longitudinal Data Analysis in Chapter 2. You can regress scores both variables on time (by person) and look at the growth curves of both variables.

From what I see in your question, it's not quite time to dive into R. Maybe you can give us some more specifics and we can tailor our suggestions.
posted by parkerjackson at 5:43 AM on December 11, 2007

You would do this with any number of time-series techniques.

Unfortunately, time series work is dark magic. There are lots of things you'd want to think about.

(1) Do you think X causes Y, or do you think changes in X cause changes in Y? You model these differently.

(2) Do you think X *now* causes Y *now*, or that the last X causes this Y, or that Y now is determined by X two or three or four periods ago?

(3) Is Y now caused by Y last time? Is the change in Y from last time to this time caused by the change in Y from t-2 to t-1?

And no doubt more; I am far from an expert on or even competent at time series.

The practical upshot is that you should give this problem to someone who knows time series, especially if the analysis is at all serious.
posted by ROU_Xenophobe at 5:54 AM on December 11, 2007

Others are correct that a more sophisticated stats package would be preferred. I will assume you only have access to Excel.

If you are looking only for correlations, use the Correlation function found in Tools | Data Analysis (you may need to install the Data Analysis add-in). Run the correlation on the date column as well as your two variables of interest. In Excel, date is a serial number that increments by 1 for each day. Thus, you can correlate the passage of time with another variable. With the data you provided the result would be:
        Date     X       Y
Date    1.00
X      -0.88    1.00
Y      -0.54    0.06    1.00
The interpretation is that Date and X have a relatively high negative correlation, but that X and Y have a relatively weak relationship. Note that, unlike regression, correlation does not "hold everything else constant."

If you have some priors about causation (e.g. Y is caused by the passage of time and X), then you would want to use the Regression function.

Note: Excel scatterplots do not understand dates. If you put time on the X-axis in an Excel scatterplot, the format-axis options you get with line or bar graphs do not appear. This really stinks.
posted by GarageWine at 9:48 AM on December 11, 2007

I'm really bad at totally abstract thinking, so I need a more concrete example.

Is X the same variable over time? Say, measuring a child's height over a period of months?

Or are X and Y related? Say, the number of fish in river = X, amount of pollution in river = Y?

I'm assuming it's the latter, in which case you do need a scatterplot to check for autocorrelation error. See this page for more. You can also download a trial version of SPSS if you're on Windows (registration req'd). You can definitely do time series analysis in SPSS.
posted by desjardins at 10:21 AM on December 11, 2007

(You'll want to scroll down near the bottom of the page I linked and click SPSS 16.0 for the 14-day trial download.)
posted by desjardins at 10:22 AM on December 11, 2007

What is the research question? Are you sure it is just a correlation that you want?
posted by tiburon at 11:09 AM on December 11, 2007

in which case you do need a scatterplot to check for autocorrelation error

You'd want a durbin-watson statistic.
posted by ROU_Xenophobe at 11:52 AM on December 11, 2007

« Older Finding a class B property rental.   |   I want to cross 3 more countries off my list. Newer »
This thread is closed to new comments.