What statistical test or model do I use?
May 21, 2013 8:03 AM   Subscribe

So I'm trying to find how two variables are related, and I have a mountain of data.

I want to see if there is any relationship between two categories of events that occur over a long range of time. This data is coded by date.

So say Event Type A occurs on January 1st. I want to see if that means it is more or less likely that Event Type B will occur on January 3rd (or a range of dates following Jan 1st.) than if Event Type A did not occur.

Basically I need the name of this type of analysis so that I can look up how to do it and find more information about it, but without a name, I'm lost.

Thanks.
posted by MisantropicPainforest to Education (12 answers total) 3 users marked this as a favorite
 
It sounds like you're trying to do lagged regression.
posted by PMdixon at 8:18 AM on May 21, 2013


t-test of difference in proportions between (all days when B occurs / all days ) and (all days when B occurred within the range of A / all days within the range of A )?
posted by thirteenkiller at 8:20 AM on May 21, 2013


Logistic regression could be used to determine the probability of Event Type B occurring on January 3rd (yes/no) if Event Type A occurred on January 1st (yes/no).
posted by Young Kullervo at 8:29 AM on May 21, 2013


Logit shouldn't work since the Painforest isn't specifically interested in whether event A at time t affects the probability of event B at time t+2. Instead (s)he's interested in whether events at time A affect the probability of event B soonish.

Far from my expertise but my experience is that blundering ahead will push the SOMEONE IS WRONG ON THE INTERNET button and cause people who know better to chime in...

One way to think about this is that you have two time series, one for event A and the other for event B. You want to know if the time series for A affects the time series for B. This would be a vector autoregression.

Another way to think about it is how long it takes for event B to happen. Does event A make event B happen sooner or later than it would otherwise? These are duration or survival models.
posted by ROU_Xenophobe at 8:44 AM on May 21, 2013


Regular old logistic regression will let you determine the odds of B with recent exposure to A compared to odds of B with no exposure to A.

If you want to look specifically at time-to-event B after exposure to A, you could use the same data to conduct a survival analysis. Your comparison group could come from non-case controls (people unexposed to A) or from the cases themselves at some earlier time point. Szklo and Nieto do a good job of walking you through how to choose a relevant comparison group in their book Intermediate Epidemiology, Beyond the Basics.
posted by The White Hat at 9:05 AM on May 21, 2013


If you have that much data, you should really start by drawing a graph. Let the x-axis be time since the last A-event, and the y-axis is the number of events (you can choose the bin size as you wish). If B is a Poisson process (your null hypothesis, meaning B occurs at random times with no correlation between events), I think you should get an exponentially decaying graph, i.e. P(t) ~ e^-t . You can draw that curve on your graph. If A causes B to happen earlier, you should see more events (above the predicted trend line) at early times, and fewer events at later times.
posted by spaghettification at 10:33 AM on May 21, 2013


You might look into Granger causality which is designed to look at whether Series A is useful in forecasting Series B. It does have its limitations, but would be a better stab at the problem than regression, I think. Here is a toolbox link in MATLAB.

In the case of discrete events, I would also recommend using spike train synchronization measurement algorithms (this is in Neuroscience), which could very easily be adapted for trains of event data.

ps. I realize these might be more complex than what you need, but they employ the same math as general statistics or other fields.
posted by ssri at 10:42 AM on May 21, 2013


The temporal aspect of this question is distracting people and resulting in far more complicated answers. It isn't really relevant other than as a how to code the data issue.
posted by srboisvert at 11:53 AM on May 21, 2013


The temporal aspect of this question is distracting people and resulting in far more complicated answers. It isn't really relevant other than as a how to code the data issue.

If the time lag were constant and known, then I would agree. But these do not appear to be the case, and so whatever statistical procedures are used must be able to cope with that.
posted by a snickering nuthatch at 12:31 PM on May 21, 2013


Do the events necessarily alternate? As in, can B happen twice before A happens again?
posted by tinymegalo at 7:13 PM on May 21, 2013


Response by poster: Yes, and there are a number of other factors that cause B.
posted by MisantropicPainforest at 6:43 AM on May 22, 2013


How closely spaced are the events? Would an A potentially cause more B's beyond the next one? Are the B-B intervals plausibly independent / a Poisson process? My instinct is to call this a variable-intensity or non-homogeneous Poisson-process model, or maybe a cox process. Potential starting place. This is not something that I do routinely.
posted by a robot made out of meat at 3:46 PM on May 22, 2013 [1 favorite]


« Older Abdominal discomfort, alcohol use and trouble...   |   How to tastefully indicate authorship on a... Newer »
This thread is closed to new comments.