Looking for a smooth function from noisy observations
May 14, 2008 3:46 PM   Subscribe

Statsfilter: I have a bunch of noisy measurements. Each has an (x,y) coordinate and a "score" for that location. Most of the scores are trustworthy, but with the occasonal outlier. I want to come up with a function f(x,y) that estimates what the score at (x, y) would be (whether or not I have an observation at exactly that location). I'd like the function to be smooth and resilient to noise. Can someone point me in the right direction?

I'm thinking of something like a kernel density estimator that takes the score into account, but I don't know how to make that work since it estimates density rather than some other value.
posted by Sockpuppet The First to Science & Nature (9 answers total) 3 users marked this as a favorite
Best answer: Simple option: Inverse-distance weighting. Each (x,y) is a function of the average scores of the surrounding (x,y) data points, weighted by their distance. There are a few parameters to fiddle with - first, how many surrounding data points to include for each predicted points? You can either use a set number of points (ie. the nearest 10) or a set distance (points within 20 units). You can also change the inverse weighting exponent. It's usually 1/distance2, but you could can change it to 1/distancez, where higher values of z will give you a tighter fit around points, and lower values will give you a smoother surface.

More complex option: Krieging. Generally considered better than IDW in every way.

Also investigate splines.

Note that interpolating data like this is a black art - getting something that's "smooth" and "resilient to noise" is often a difficult task.
posted by Jimbob at 4:04 PM on May 14, 2008 [1 favorite]

(I should say that Krieging is useful because it can give you information on how good the fit is, and what sort of spatial autocorrelation there is present in your data.)
posted by Jimbob at 4:06 PM on May 14, 2008

Oh a final comment from experience. These methods tend to be pretty dreadful at extrapolating data outside the range of (x,y) values from your data. The "borders" of your space need to be pretty well covered by sample points. If your data ranges from (0,0) to (10,10), don't bother trying to predict a value at (12,5) or something, because it will be pretty unreliable.
posted by Jimbob at 4:09 PM on May 14, 2008

You could always just try regressing the scores on a polynomial of the coordinates. That is, have x, x^2, x^3, ... , y, y^2, ... , xy, x(y^2), ... etc. as the regressors. The predicted values will be quite smooth, but it may not handle the outliers well.
posted by thrako at 4:27 PM on May 14, 2008

Also, I think you could still do the kernel density thing, especially if the scores must be positive. Just treat the score at a point as the number of observations at that point (a score of 7 means 7 observations stacked on a single spot). Then you should be able to plug it right into a kernel density tool.
posted by thrako at 4:33 PM on May 14, 2008

I admit I have used kernel density to do this before myself. The problem was the lumpiness of the data. Areas with no sample points within the kernel window just came out as big empty spaces, no values could be predicted. Increasing the window size resulted in a surface that was too smooth, missing interesting variations in the data. It will, as with the other methods I've listed above, be a matter of trial and error.
posted by Jimbob at 4:40 PM on May 14, 2008

Do you have an idea of what the underlying relationship is (from a hypothesis or just plotting it)? Do you care or do you just want a plot? How much risk of over-fitting are you willing to tolerate? Are you planning to delete your "outliers" or are they still data?

If it's just a picture I'd probably slap on a bicubic hermite or something standard like that.
posted by a robot made out of meat at 5:34 PM on May 14, 2008

Response by poster: Hm, it looks like Kireging won't be resilient to noise. From the wikipedia article:
The kriging estimation honors the actually observed value: \hat{Z}(x_i)=Z(x_i)

Jimbob: your reason for kernel density not working is exactly my problem.

A robot: The underlying relationship is weird and nonlinear, but I don't care and just want a plot. I don't want to overfit so much that one bogus observation would screw up any interpolation near it. Also, this would make my plot less pretty/believable. As far as outliers go, I've deleted anything that's obviously wrong, but I'm sure there are still a number of data points that are sketchy at best.

Can you point me at information on bicubic hermites? It's not something I've ever slapped on anything, nor heard of for that matter :) Googling reveals some mighty technical stuff on Hermites that isn't helping me.
posted by Sockpuppet The First at 11:47 PM on May 14, 2008

If you just want a plot and have access to Mathematica, look up the help-file on InterpolatingFunction. It does a reasonably good job, and it's quick. I haven't bothered looking into what methods it uses.

As for your underlying model, even if the value itself depends in a complicated way on various things, the story doesn't have to end there. For example, if you happened to be working with potentials, you could exploit the properties of Laplace's Equation to do your interpolation.
posted by dsword at 6:56 AM on May 15, 2008

« Older nice chair   |   Don't forget your massage Newer »
This thread is closed to new comments.