Statistics; correlations for a subset of a timeseries of points?
October 5, 2015 9:43 PM   Subscribe

I have a time series of data - number of "detections" per hour, over time. Each of these points also has two attributes associated with it, x and y. I also have a second time series of data which I expect to correlate with that number of detections per hour, but only with detections within certain combinations of those x and y values. I want to find what range of x and y values has the best correlation.

I don't really have the word to describe what I need to do, but I'll refer to the following graph.

Here it is.

At the bottom I have a time series of counts-per-hour of these events (which are, by the way, counts of particles measured by a new particle detector). I've also got another set of data, measured by a different method (a traditional manual pollen counter), that will also provide a measure of counts-per-hour (although probably at a different scale).

Each of the counts also has an x value and a y value associated with it. At the top, I've taken all the counts in the bottom chart, and plotted the x and y bins they fall into.

Now I want to establish a correlation between this hourly time series, and my alternative validation time series - but I know that the correlation will be strongest for some subset of those counts - for example, it may be that the highest degree of correlation between the two time series is found just for the subset where x is between 50 and 75, and y is between 0.2 and 0.5. This "zone" would represent particles most likely to be pollen. Counts outside this zone might be other things, eg. dust, which aren't picked up by my traditional pollen counter. Ultimately I want to be able to find these optimum ranges for the x and y values.

Intuitively, I can see that I could take various subsets of the data for various ranges of x and y values, perform a time series regression against my validation data, and then look for the subset that performed best; repeating this hundreds of times for various subsets is certainly possible, but I'm wondering is there an established statistical technique I could use to do this rather than resort to that sort of brute-force approach?
posted by Jimbob to Science & Nature (6 answers total) 2 users marked this as a favorite
 
I had something longer written out, but then I reread your question. Optimizing for two ranges, yeah, I'd totally just brute-force it (assuming you can do it programmatically). I'd try ranges at various scales and then plot them with the "pixels" colored by goodness-of-fit. Correct region should light right up.
posted by supercres at 5:11 AM on October 6, 2015


you have two sets of observations (A, B). observations in one set (B) are labelled in a certain way. you want to use the labels to define a subset of B, B', to maximise the probability that observations in A and B' are drawn from the same population.

i think.

but is that well defined? what would stop you from reducing B' to the observation closest to the mean value in A? it seems like you also want something that says that you include as many points in B' as possible.

i am pretty sure there would be a "right way" to do this, but i have no idea what it is. are you at a university? i once visited the maths/stats dept where i worked and got help with statistics (in retrospect i should have offered joint authorship and i feel bad that i didn't, so you might consider that).
posted by andrewcooke at 5:32 AM on October 6, 2015


Thought about this more. Take each X-Y pixel and correlate the observations in that small X-Y range square with the validation set. Then plot the correlations in an X-Y heatmap. Tune the size of the "pixels" to balance reliability of the correlations with X-Y value specificity. Or do some kind of 2-d smoothing at a higher resolution.

I have a feeling that Gaussian process modeling might be the "best" thing to do here, but it also might be overkill. It can turn discrete observations into a model of continuous time series, and can also turn discrete spatial observations into a modeled continuous spatial map.
posted by supercres at 8:24 AM on October 6, 2015 [2 favorites]


My thoughts were along the same lines as the heat map suggested by supercres.

I hope you do your own programming. With a subroutine to calculate the correlations, the program isn't more than about 20 lines (exclusive of reading the data in and writing the results in a useful manner).

You want to know both the correlation and the number of observations in each box. This approach will tell you both where your data is and where the sweet spot is.

You need to be careful of methods that search for the good results and drop the bad.
posted by SemiSalt at 10:15 AM on October 6, 2015 [1 favorite]


This seems basically like a prediction problem with feature selection. The brute-force approach is essentially what people do to build predictors all the time, with some heuristics added to avoid having to test literally every combination of features. (There may be some clever techniques for quickly finding spatially-adjacent blocks of predictors, maybe from image analysis and/or the fMRI literature? Not my field, though, so I couldn't tell you much off the top of my head.)

The potential pitfall with this situation, though, is that you need to watch out for overfitting to your validation data, both because you want to make sure your predictions are generalizable and because you want to accurately report your performance -- however, you have time series data, and so the individual observations are going to be correlated to each other. That means that cross-validation (a normal way of evaluating predictor/classifier performance, where you divide your validation set into k bins, train the model on [e.g.] k-1 of the data bins and evaluate your performance on the remaining one, then repeat so you get a prediction for each bin), could fail, because the "training" set and "test" sets wouldn't be independent. I've never had to do cross-validation on time series data personally, but searching those terms does pull up some recent papers, so you might get a few ideas there.

(I'm assuming that the experiment is enough of a pain that you couldn't easily gather another, independent validation dataset and use that one to test your performance, but that would of course be the ideal solution.)
posted by en forme de poire at 11:16 AM on October 6, 2015


Response by poster: I do indeed to my own programming, I'm quite adept in R, and I think the heatmap idea is probably the most reasonable - although I do worry there might not be enough observations for a given bin to establish a good time series correlation. On the other hand, this is an ongoing dataset, I can let it go as long as I want to gather enough data.
posted by Jimbob at 4:42 PM on October 6, 2015


« Older Android games for pre-teens learning English   |   Removable window cavity insulation Newer »
This thread is closed to new comments.