Standard Deviant Behavior
April 18, 2007 7:09 AM Subscribe
IANAS. How can I average two standard deviation values?
Long story short: I have sampled periods for which I have a mean and a standard deviation. Some of the periods are short, so I'd like to combine adjacent periods to improve the stats. The problem is, the means from the periods may not match up at all, while the sigmas should match very well. (These are objects that can move around, so their absolute position can change by a large amount but their distribution around a given position is fairly constant.)
My gut (and my genius [seriously] boss) both say that practically speaking I can just average the two sigmas and get a reasonable value. Actually, I'd probably want to weight it so the current sigma is favored. But just hard-coding a weighting wouldn't distinguish between these two cases:
Case 1: The current period has 3 samples and the previous one has 100
Case 2: The current period has 40 samples and the previous one has 60.
In the first case, I'd like the previous period to weigh more since the current period is useless. In the second case, I'd like the current period to weigh more since it is pretty good and much more recent.
So my next thought was to weight the individual measurements, say by using the first one once, the second one twice, the third 3 times, or whatever. (I guess the "sigma" would be calculated by using the difference from period's mean that each sample came from....) However, I'm pretty far out on a limb here, since I have no statistical training whatsoever. How can I do this simply and without causing an aneurysm in any future employees that have a math degree?
More information:
Long story short: I have sampled periods for which I have a mean and a standard deviation. Some of the periods are short, so I'd like to combine adjacent periods to improve the stats. The problem is, the means from the periods may not match up at all, while the sigmas should match very well. (These are objects that can move around, so their absolute position can change by a large amount but their distribution around a given position is fairly constant.)
My gut (and my genius [seriously] boss) both say that practically speaking I can just average the two sigmas and get a reasonable value. Actually, I'd probably want to weight it so the current sigma is favored. But just hard-coding a weighting wouldn't distinguish between these two cases:
Case 1: The current period has 3 samples and the previous one has 100
Case 2: The current period has 40 samples and the previous one has 60.
In the first case, I'd like the previous period to weigh more since the current period is useless. In the second case, I'd like the current period to weigh more since it is pretty good and much more recent.
So my next thought was to weight the individual measurements, say by using the first one once, the second one twice, the third 3 times, or whatever. (I guess the "sigma" would be calculated by using the difference from period's mean that each sample came from....) However, I'm pretty far out on a limb here, since I have no statistical training whatsoever. How can I do this simply and without causing an aneurysm in any future employees that have a math degree?
More information:
- While I do have access to the original measurement values, I'd prefer a solution that just let me use the sigma itself (possibly coupled with a count of the number of samples it is comprised of) in a simpler expression.
- I have a plan to totally revamp the sampling period problem, but that has to wait until the next version of the software. This version has to do the mean/sigma thing.
Response by poster: Hot diggety awesome!
This is kind of a secondary effect, but what about the issue of time decay? That is, the sigma will change over time (though possibly too slowly for me to get as many samples as I need), so I'd like recent data, even if it is somewhat less numerous, to be weighted more heavily. Add some coefficients on (n1 - 1)s and subtract them from the denominator? Or maybe just bump n1 down a bit and n2 up a nudge. (You can see we play fast and loose with stats in order to lie more easily.)
Also, I assume it's generalizable in the obvious way. n1 + n2 + ... + nN - N and so forth, in case even the previous period is too short?
posted by DU at 7:43 AM on April 18, 2007
This is kind of a secondary effect, but what about the issue of time decay? That is, the sigma will change over time (though possibly too slowly for me to get as many samples as I need), so I'd like recent data, even if it is somewhat less numerous, to be weighted more heavily. Add some coefficients on (n1 - 1)s and subtract them from the denominator? Or maybe just bump n1 down a bit and n2 up a nudge. (You can see we play fast and loose with stats in order to lie more easily.)
Also, I assume it's generalizable in the obvious way. n1 + n2 + ... + nN - N and so forth, in case even the previous period is too short?
posted by DU at 7:43 AM on April 18, 2007
For the latter question, yes, that's how you can pool more than one sample. For the former, I think that goes beyond any scope that can be addressed in an elementary way. Unless you know how the standard deviations are changing over time - and that's an inferential problem in its own right - then using coefficients to weight the recent trials would really be nothing more than guessing. Unless you have strong evidence that the standard deviations are changing rapidly, I would be inclined not to try weighting them. An alternative approach would be to try using intervals and interval arithmetic rather than a point estimate, but that would significantly change your programming.
posted by Wolfdog at 8:01 AM on April 18, 2007
posted by Wolfdog at 8:01 AM on April 18, 2007
Response by poster: I think for the most part the sample periods are shorter than the "lifetime" of a sigma, but there are probably counterexamples in the database. I guess what I'll do is put a time limit on it so if the previous sample reaches back too far, I'll either not use it or only use part of it.
Thanks again!
posted by DU at 8:07 AM on April 18, 2007
Thanks again!
posted by DU at 8:07 AM on April 18, 2007
Sounds a little like a Kalman Filter could be helpful.
posted by Chuckles at 7:37 PM on April 19, 2007
posted by Chuckles at 7:37 PM on April 19, 2007
This thread is closed to new comments.
The "pooled" standard deviation is
sp = Sqrt( ( (n1-1)s12 + (n2-1)s22 ) / (n1+n2-2) )
This will weight them appropriately.
posted by Wolfdog at 7:22 AM on April 18, 2007