Help me model some curves that have a certain time profile.
September 9, 2014 7:39 AM   Subscribe

I've got a bunch of curves that (I hope) show some common profile over time, although they maybe scaled versions of each other: at any index along in the curve >1, the next point is conditional on all (or some) of the previous. See this greatly simplified example.

The units on the X axis are just time increments after the first observation, like n+1,n+2. So we can't use the X axis values to train. That is, the first point in the curve could come at X=1 or X=100. But its units are consistent between curves. Anyway, each curve is just a 1-d list of single real numbers.

Anyway, I need to train some model by feeding it these training curves and telling it which point is "Go!". Then I want to feed in new, unseen lines, one value at a time, and have the model say "Ready?" when "Go!" is going to be the next point.

It sounds like Posterior predictive distribution does something like this, but I don't have time to research it now. I have 24 hours. Really I need a library (like Sci kit learn) to just fit to the training values, take some test values and say if "Go!" will be next or not. Can someone advise on what I need?
posted by ed\26h to Science & Nature (8 answers total)
This isn't a full answer but it might help. If the different series are scaled linearly, then you can simply multiply each series y value by the inverse of that series maximum y value. Assuming each series was linearly scaled they should all be identical after the scaling operation (IIRC). Then the trick would be to find the correct scaling factor of the new function, which should occur at the same x value as the training series, if I've understood the problem correctly.
posted by forforf at 10:14 AM on September 9, 2014

Response by poster: Thanks, forforf,

Scaling would be good. But I can't know in advance what to scale each one by. And I'm only getting one point at a time.

Anyway I've now got some actual plots now which might help explain.

I've got lots of curves like the blue one. In practically all of them the highest point yet observed occurs just before the event shown by the red marker. I want to train a model, telling it where the red marker is each time, on those blue curves. When trained I need to show the model the orange line, one point at a time, starting from the left. When it gets to the pre-event max, I want it to infer, based on it being the max so far, and on the previous points and the relations between then, that the event is about to happen, and flag that up.

It's basically a machine learning classification task. The classifier decides if the point it's being shown is the kind of value that suggests an event will happen. Unlike a normal task, though, is it will have to take into account all the points it has been shown prior to the current one. Else it can't know from just one number.

Hope that's a bit clearer.
posted by ed\26h at 10:54 AM on September 9, 2014

I wonder how far you can get without full-on machine learning.

In the example that you gave, let's imagine that I just look at the relative change in Y. At the third point, it plummets. We can detect this "plummet" by constantly monitoring percentage change in Y, and setting a threshold. Then we can predict that "Go!" will happen roughly three time steps later.

Together with appropriate normalization, and maybe a bit of time-averaging (smoothing), this approach may or may not work - it depends on the complexity of the underlying model. The advantage is that it should be very quick to code up, compared to heavyweight nonlinear regression.
posted by liliillliil at 11:41 AM on September 9, 2014

Response by poster: To be honest I don't at all mind serious machine learning. I just don't have time to code it. If there's a pre-existing library I can use that no problem.
posted by ed\26h at 11:45 AM on September 9, 2014

Best answer: I agree that you could use some kind of binary classifier on each data point, where class 0 is all the points *not* immediately before an event, and class 1 is all points immediately before an event.

The hard part here is deciding what features to pass to the classifier. If you can keep track of overall variance in the time series, you could train a classifier based on how much of an outlier a given point is. You might also want to use "number of data points seen" as a feature as well, since the first few data points will always be "outliers".

You already know of an appropriate library (scikit-learn). What you are really asking about is how to set up your problem, and that's not something a library can do for you. You have to make some choices about how you are going to set this problem up first, and what your data representation will be. Then scikit-learn should have the tools you need to crunch the numbers fairly quickly.

You also need to consider the effects of false positives / negatives on your outcome. How serious is it if you classify an "empty" data point as "eventful", and vice versa? No classifier is perfect, and each one will exhibit errors in different ways. You need to know what is appropriate for your use case before you go building your machine learning algorithm.
posted by grog at 2:44 PM on September 9, 2014

Best answer: This is an interesting problem. The fact that the data are time-ordered suggests that you might consider some form of hierarchical model with observations nested within subject (i.e., participants, stock, machine, etc.). Furthermore, it sounds like the discontinuity you expect to observe in each curve makes this a problem that is well suited to some form of spline model.

I know there are already many implementations in R (and Python, too). But since it seems from your reference to posterior predictive distributions that you're familiar with Bayesian methods, I would just point out that this kind of model is not too difficult to code up in JAGS or Stan.

Some form of hierarchical model that explicitly accounts for the time dependency seems essential. And it sounds from your description that you're trying to estimate the location of the "knot", so some variety of nonlinear spline model might be right for this problem.

I suspect that a lot is going to depend on the number of time series you have, and the amount of a priori information you have about the knot location. A Bayesian hierarchical model could handle this situation very nicely, I think. It's an interesting problem, indeed.
posted by grisha at 6:11 PM on September 9, 2014

This may be too simple for your purposes, but could you do something like a while loop in R? If y is the value at time x:

x = 0
while (y[x] > y[x-1]) {
x <> if(y[x] < y[x-1]) {
print(c(x-1, y[x-1]))

That will keep moving back along x until it hits a peak, and print that peak once it's found. It may not be the highest peak, if it's a jagged descent, but you could change the 4th line to fix that.
posted by MrBobinski at 6:34 PM on September 9, 2014

Response by poster: Thanks all. These are some useful insights. Really I had a very short time to do this task and I'd hoped it would be a well-know problem (perhaps in signal processing) and so there would be some library to do all the work for me. Not to be.

Still, stimulating, as you say.
posted by ed\26h at 12:27 PM on September 10, 2014

« Older Non-soppy wedding readings for non-heterosexual...   |   Going to California near Huntington Beach in... Newer »
This thread is closed to new comments.