Help me calculate a hidden value in my data
April 3, 2006 6:18 PM Subscribe
I've spent the past day or so trying to figure out how to calculate a hidden value in my data. It varies linearly with time, or closely enough that we can make that assumption within any given dataset. Where this goes beyond a simple linear regression is that each datapoint is known to be above or below the hidden value. The data is collected in a C# program and processed in Excel; an Excel function would be the ideal solution.
A chart makes the problem clear:
The Y axis is the time the reading was collected, the X axis is the value of the reading. We need the slope and offset of the narrow empty band running between the two datasets. This data is fairly clean; we ideally want something that will work even if the readings aren't clustered right up along the hidden value.
It's easy to eyeball the answer, but we need to have this calculated automatically. (Technically, we need to cancel it out, so that the chart above would follow a straight vertical line.) We've gotten passable results by doing a linear regression across the entire dataset, but it's possible to get big outliers which would unacceptably skew the results. We've considered trying to filter out the outliers as well, but I'm convinced that this can be calculated directly.
Any statistics gurus out there?
A chart makes the problem clear:
The Y axis is the time the reading was collected, the X axis is the value of the reading. We need the slope and offset of the narrow empty band running between the two datasets. This data is fairly clean; we ideally want something that will work even if the readings aren't clustered right up along the hidden value.
It's easy to eyeball the answer, but we need to have this calculated automatically. (Technically, we need to cancel it out, so that the chart above would follow a straight vertical line.) We've gotten passable results by doing a linear regression across the entire dataset, but it's possible to get big outliers which would unacceptably skew the results. We've considered trying to filter out the outliers as well, but I'm convinced that this can be calculated directly.
Any statistics gurus out there?
There are no excel formulas that i'm aware of that are going to give a better solution than a least-squares line - which is your regression line, as you said. So really your only problem is what to do with your outliers. I agree with unSane - you say that your data are clean, but really they're pretty messy even though there is an obvious pattern there.
Maybe you could give us more information about the outliers - how you know when you have one, for example, then a formula could be written to identify them when they come along. If you get rid of them your regression line should fall right down that empty canyon between the two datasets.
posted by crapples at 6:34 PM on April 3, 2006
Maybe you could give us more information about the outliers - how you know when you have one, for example, then a formula could be written to identify them when they come along. If you get rid of them your regression line should fall right down that empty canyon between the two datasets.
posted by crapples at 6:34 PM on April 3, 2006
At first glance, the X range on the graph appears to be only 40, whereas the Y range is 100k or so. Without types to match values to, could be messy or (relatively) tight.....
as folks have said above, a least squares will get you about as good as you can. That said, not knowing what the data / application is / how many points are contained in that narrow band, you could look at either prespecifying or calculating a running sigma such that you could elim anything over 2-3 sigmas from the expected mean (calling it variance due to special causes, etc)
its late and i'm a bit tired, but post back if you'd like more of an example.
posted by gage at 6:47 PM on April 3, 2006
as folks have said above, a least squares will get you about as good as you can. That said, not knowing what the data / application is / how many points are contained in that narrow band, you could look at either prespecifying or calculating a running sigma such that you could elim anything over 2-3 sigmas from the expected mean (calling it variance due to special causes, etc)
its late and i'm a bit tired, but post back if you'd like more of an example.
posted by gage at 6:47 PM on April 3, 2006
Response by poster: The problem is we don't really know what our outliers are; we only recognize them as outliers by their distance from the mystery line. And this data is much cleaner that what we expect to get in the field.
(Full disclosure: these are timestamps from network packets, and the "slope" is a drifting clock in one of the machines. This data is from a local network, so it is pretty reliable (although you can see what a piece of crap the wireless router is); in the field, this might be in the hundreds of ms, and varying much more wildly. The clock drift itself appears to be reliably linear within the timeframes for our tests, though.)
I'm inclined to go with b1tr0t's solution -- divide the timespan into narrow windows, take the narrowest values from each of those, then do the regression from that.
gage, I would be interested in how we could keep a running sigma.
posted by bjrubble at 6:57 PM on April 3, 2006
(Full disclosure: these are timestamps from network packets, and the "slope" is a drifting clock in one of the machines. This data is from a local network, so it is pretty reliable (although you can see what a piece of crap the wireless router is); in the field, this might be in the hundreds of ms, and varying much more wildly. The clock drift itself appears to be reliably linear within the timeframes for our tests, though.)
I'm inclined to go with b1tr0t's solution -- divide the timespan into narrow windows, take the narrowest values from each of those, then do the regression from that.
gage, I would be interested in how we could keep a running sigma.
posted by bjrubble at 6:57 PM on April 3, 2006
Best answer: This is a classification problem. You are looking for the line that separates the pink and blue dots. Your data is actually amazingly clean as there is no overlap.
There are different possible classifiers to use, but given the way the data looks I think your best bet is to use a linear Support Vector Machine (SVM), no kernels.
While, SVM, with its maximal margin solution strikes me as the best solution. I don't know if you can find it in Excel, (which is a crappy tool for data analysis in my opinion). You can try 2-class logistic regression instead, which might give you similar results, though it would still be susceptible to the kind of outliers you have, where SVM would not be.
posted by blueyellow at 7:16 PM on April 3, 2006
There are different possible classifiers to use, but given the way the data looks I think your best bet is to use a linear Support Vector Machine (SVM), no kernels.
While, SVM, with its maximal margin solution strikes me as the best solution. I don't know if you can find it in Excel, (which is a crappy tool for data analysis in my opinion). You can try 2-class logistic regression instead, which might give you similar results, though it would still be susceptible to the kind of outliers you have, where SVM would not be.
posted by blueyellow at 7:16 PM on April 3, 2006
Weka is a fantastic completely free tool for data mining and analysis that includes SVMs amongst many other machine learning and statistics algorithms.
posted by blueyellow at 7:21 PM on April 3, 2006
posted by blueyellow at 7:21 PM on April 3, 2006
Another very powerful and completely free tool for data analysis is the R language for statistical computing, which also has SVMs and most things you would want to use.
SVMs can also be used if your data is noisier (overlaps between pinks and blues). The nice thing about SVMs is it would automatically only care about the points around the line and ignore things farther away.
posted by blueyellow at 9:23 PM on April 3, 2006
SVMs can also be used if your data is noisier (overlaps between pinks and blues). The nice thing about SVMs is it would automatically only care about the points around the line and ignore things farther away.
posted by blueyellow at 9:23 PM on April 3, 2006
You could always give it to Amazon's Mechanical Turk and pay someone in India a nickel to draw the line for you.
posted by kindall at 9:59 PM on April 3, 2006
posted by kindall at 9:59 PM on April 3, 2006
If you know you have a drifting clock, could you run something like NTPD to get an accurate measurement, and/or lock the clock down?
Network packets are inherently unreliable, so you'd probably do better with a specific algorithm to fix (or at least measure) your clock, and THEN deal with the network.
What you're trying to do here is determine clock skew via osmosis. There are much better ways.
posted by Malor at 11:33 PM on April 3, 2006
Network packets are inherently unreliable, so you'd probably do better with a specific algorithm to fix (or at least measure) your clock, and THEN deal with the network.
What you're trying to do here is determine clock skew via osmosis. There are much better ways.
posted by Malor at 11:33 PM on April 3, 2006
Best answer: I was going to suggest something along the lines of identifying outliers based on moving range between points being over 2.5 or 3 times the average moving range, then using a plain old linear regress to calculate the over/under line.
That said, after looking at SVM (wow!), though, the above method looks pretty darn lame. I must bow down to blueyellow's knowledge.
Example of me drawing a sample of your dataset then using SVM to classify:
and another:
Looks like there's a even a C# port of libsvm here
posted by gage at 9:54 AM on April 4, 2006
That said, after looking at SVM (wow!), though, the above method looks pretty darn lame. I must bow down to blueyellow's knowledge.
Example of me drawing a sample of your dataset then using SVM to classify:
and another:
Looks like there's a even a C# port of libsvm here
posted by gage at 9:54 AM on April 4, 2006
Response by poster: Cool! I have no idea what SVM is, but it looks like it would solve our problem perfectly. Thanks!
posted by bjrubble at 11:32 AM on April 4, 2006
posted by bjrubble at 11:32 AM on April 4, 2006
Very interesting problem. As I understand it you are observing the data points essentially without error and there is an 'empty band' between the two types. If so, I think the following procedure would give the optimal linear estimate and should be doable in excel (but maybe too late to be useful). But I am certainly interested to know 1: if my method makes sense, 2: if it is equivalent to what the SVM stuff would do given a linear relationship and 3: where I can find out more about SVM.
1. On your excel graph randomly pick one blue point and one pink point. These define a line.
2. Find the vertical distance between each point and this line (the OLS residual, but this ain't OLS).
3. Now choose the blue point with the most negative residual and the pink point with the most positive residual. If there is no negative blue residual choose the same point, same thing for positive pink residuals. These two points define a new line
4. Repeat step steps 2 and 3 until points stop changing.
This will quickly 'rotate' the line until it divides all the blue point from the pink ones, and it will pass though a pink point at the top and a blue point at the bottom. Call this a 'limit line'.
Now repeat the procedure but choosing the most positive blue residual and the most negative pink residual. This will rotate the line until it passes through a pink point at the bottom and a blue one at the top. This is the second limit line. The two limit lines form an 'X' shape (in your graph a very skinny X), with data points in two opposite quadrants and nothing in the other two quadrants. The empty quadrants contain the set of all possible 'division lines', lines which keep all the blue points on one side and all the pink ones on the other. The best guess of the actual division line will be the line that passes through the intersection of the limit lines and bisects the empty quadrants.
posted by thrako at 7:43 PM on April 4, 2006
1. On your excel graph randomly pick one blue point and one pink point. These define a line.
2. Find the vertical distance between each point and this line (the OLS residual, but this ain't OLS).
3. Now choose the blue point with the most negative residual and the pink point with the most positive residual. If there is no negative blue residual choose the same point, same thing for positive pink residuals. These two points define a new line
4. Repeat step steps 2 and 3 until points stop changing.
This will quickly 'rotate' the line until it divides all the blue point from the pink ones, and it will pass though a pink point at the top and a blue point at the bottom. Call this a 'limit line'.
Now repeat the procedure but choosing the most positive blue residual and the most negative pink residual. This will rotate the line until it passes through a pink point at the bottom and a blue one at the top. This is the second limit line. The two limit lines form an 'X' shape (in your graph a very skinny X), with data points in two opposite quadrants and nothing in the other two quadrants. The empty quadrants contain the set of all possible 'division lines', lines which keep all the blue points on one side and all the pink ones on the other. The best guess of the actual division line will be the line that passes through the intersection of the limit lines and bisects the empty quadrants.
posted by thrako at 7:43 PM on April 4, 2006
This thread is closed to new comments.
posted by unSane at 6:23 PM on April 3, 2006