Best Fit Ellipse?
April 12, 2006 11:58 AM   Subscribe

How can one create a best fit ellipse around a cluster of (x,y) points in a scatter plot within Excel?

I would like to use an equation and not just draw an ellipse around a cluster. Excel can show a linear regression line equation that goes through the data. I'm trying to think of a way to make a part that line the major axis of the ellipse and somehow determine the maximum extents (a & b) that the ellipse has to encompass. Can I do something like a k-means algorithim to determine the centroid of the data cluster and figure out the maximum extent of the data from that center point? Thanks.
posted by geonags to Science & Nature (33 answers total) 1 user marked this as a favorite
 
It might help to know exactly what the ellipse is supposed to represent, in the way a regression line represents the expectation of the dependent variable given the independent variable. Is your ellipse supposed to be some sort of confidence bound for the means of your variables? Or do you just want to minimize the squared distance from the points to the ellipse?
posted by thrako at 12:28 PM on April 12, 2006


My first thought on how to do this is principle component analysis.

First, find the average (let's call them xBar and yBar) coordinates of all the points. This will be the center of the ellipse.

Then, for each point, find the vector [ x - xBar, y - yBar ] and multiply it by its transpose such that the result is a 2x2 matrix. Add up the matrices you create this way by each point into a single 2x2 matrix.

The eigenvectors of this matrix will be the axes of your ellipse. The corresponding eigenvalues will be the lengths of each axis.

Basically, the larger eigenvector shows the magnitude and direction of the maximum variance of the data and the smaller, orthogonal eigenvector shows the magnitude and direction of the minimum variance of the data.

Some of the points will lie outside this ellipse, since it shows their variance, but you can scale it so they all fit inside. Its orientation and proportions should fit the scattered data nicely, though.

No idea on how to do this in Excel.
posted by driveler at 12:28 PM on April 12, 2006


If it is enclosing the data, then there are minimum volume ellipsoid algorithms , which are efficient (convex). Good luck doing it in excel. This seems to be some MATLAB code to do it.

If you want the best "fit" ellipsoid to the data (in a covariance sense) then I second using PCA, which will be faster.
posted by blueyellow at 12:46 PM on April 12, 2006


Maybe an EM type algorithm? The link has a nice applet you can play with.
posted by logicpunk at 12:49 PM on April 12, 2006


My thought? Use some form of regression to find the centroid of the point set, and then use the distance formula and arctangent to do a polar-rectangular conversion on the data using the centroid as your center point, and then do a (maybe? someone check my math) max-value sinusoidal regression on the resultant set. find your max/min vals on the curve, and those should be your two radii for the ellipse.
posted by potch at 12:56 PM on April 12, 2006


for two dimensions, what driveler describes is just a matrix of the correlation coefficients. i suspect excel gives you those. it's probably (i haven't checked, but it has to be something like the following; and easiest to get right just by swapping things until the ellipse looks right)
[ Cxx  Cxy ]
[ Cxy  Cyy ]

posted by andrew cooke at 12:58 PM on April 12, 2006


oh, and the angle of the major axis will be the angle of the "least squares" best fit line through the data.
posted by andrew cooke at 1:05 PM on April 12, 2006


Should be Cyx in the lower left hand corner of andrew's covariance matrix I suspect.

The advice so far sounds pretty good. You definitely need to distinguish between a maximum likelihood estimate of your data (fitting a skewed Gaussian using an argument of 1 to k-means) and a minimum enclosing ellipse (you could solve this problem by forming a constrained optimization problem from your data set).
posted by onalark at 1:06 PM on April 12, 2006


Should be Cyx in the lower left hand corner of andrew's covariance matrix I suspect.

The matrix is symmetric, so Cyx = Cxy.
posted by Elpoca at 1:13 PM on April 12, 2006


Bear in mind that an ellipse can be fully represented by two points, its foci. Therefore, when you draw an ellipse, you are representing two points. The parameters you have control over are the positioning of the two points and, therefore, the distance between them; these parameters should bear some relation to your data set.

If you can't think of a way in which useful information about your data set can be described by specification of two points, don't use an ellipse. It will be confusing without adding anything.

A linear regression line minimizes the sums of distances from the data points to the line, on the assumption that the function that generated the data is somehow linear in nature. You can't do this with an ellipse, because there's no well-defined way to generate the distance from an ellipse to a point. You also can't generate the underlying model that the function that generated your data is ellipsoidal, because for some values of x, the ellipse has two separate points with differing y values - i.e., it's not a function.

Driveler's idea is good, but I would suggest that if you care to present information about variance, that the centroid be plotted with the relevant eigenvectors proceeding from it. This makes inituitive sense as a visual way to represent variance, which an ellipse does not. Driveler's suggestion to scale the ellipse loses information and should be avoided.

An ellipsoid is a 3 dimensional solid. I'm having trouble understanding how formulae for ellipsoids could be useful here.

If it's not clear, I think that what you're trying to do a) isn't possible b) isn't necessary c) may add confusion.
posted by ikkyu2 at 1:14 PM on April 12, 2006


Alright, not to be too pedantic, but you CAN do a least-squares fit of an ellipse to data.

Assume an equation of the form:

ay^2 + bxy + cx + dy + e = x^2

Then do a least-squares fit for the coefficients a-e to your x and y values. This will try to draw an ellpise through all your data points. You can also form the total least squares solution (minimum geometric distance from each point to the ellipse) if you take the SVD. The above equation solves a general conic, so if your data does not look like the sampling around an ellipse, this might fit something completely different.

And yes covariance matrices are symmetric, sorry it's been a long day.
posted by onalark at 1:22 PM on April 12, 2006


Thank you all for some great suggestions and insights. Here is what I decided to do:

I created a column from 0 to 360 with 0.5 increments.

I then used these two equations to generate an ellipse parametrically...

x = a cos (@ * pi/180) + h sin (@ * pi/180) + x0
y = b sin (@ * pi/180) + h sin (@ * pi/180) + y0

where a is the major axis distance from the center
b is the minor axis distance from the center
h is the slope

I used the slope from the from the equation of the linear regression line through the data.

x0 and y0 are the average of all the coordinates of the data

I graphed my data set, did a regression line to find the slope, figured out the average center of the data, and graphed the ellipse data (used a line to connect all the points)

Then I just fudged around with the a and b values until the ellipse encompassed the data. I guess I'm just going to go for something illustrative and not really that quantitative, since I can't figure out any way else to do it in excel. Are there any open source programs that do this sort of thing?

Thanks again to everyone!
posted by geonags at 1:32 PM on April 12, 2006


I'm sorry, but this requires some clarification:

If you can't think of a way in which useful information about your data set can be described by specification of two points, don't use an ellipse. It will be confusing without adding anything.

If the goal of the ellipse fit is to use it for visualisation, then the foci may be unimportant, but the ellipse is.

A linear regression line minimizes the sums of distances from the data points to the line, on the assumption that the function that generated the data is somehow linear in nature. You can't do this with an ellipse, because there's no well-defined way to generate the distance from an ellipse to a point. You also can't generate the underlying model that the function that generated your data is ellipsoidal, because for some values of x, the ellipse has two separate points with differing y values - i.e., it's not a function.

An ellipse is most definitely a function, just not a single-valued one. And, as pointed out by onalark, an ellipse can definitely be fit by (non-linear) least-squares: i.e., regression!

Are there any open source programs that do this sort of thing?

If you have access to Octave (difficult to install, I believe), then there are likely many Octave-compatible MATLAB scripts that will perform a least-squares ellipse fit. Hardly the easiest solution, unfortunately...
posted by Elpoca at 1:43 PM on April 12, 2006


ay^2 + bxy + cx + dy + e = x^2

Then do a least-squares fit for the coefficients a-e to your x and y values


confused - i assume you're saying that for each (x,y) you calculate something, square and sum those values, and finally you vary a-e to minimise that result.

but what is that something that you calculate? is it ay^2 + bxy + cx + dy + e - x^2?
posted by andrew cooke at 1:59 PM on April 12, 2006


Yes, andrew, as Elpoca clarified, I'm talking about non-linear least squares.

The exact calculations are:

(Least-Squares fit minimizing y-distance)

A*C = B

[y.^2 x.y x y 1]*[C] = [x.^2]

Where A is a N*5 matrix, C is 5*1 (coefficients), and B is N*1. N is the number of datapoints (hopefully more than 5, hopefully more than 5 dependent rows in A).

You can use the normal equations, QR, or an SVD. The advantage of the SVD is that it can be used to form the total least-squares fit from this data.
posted by onalark at 2:09 PM on April 12, 2006


can you give a geometric description of what that's doing? for a least squares straight line fit, i know that i'm minimizing the sum of the squares of the perpendicular distance of each point from the line.

but i have no idea what the geometric interpretation for the expression you give is (i'm pretty sure it's not the perpendicular distance from a point to the nearest of the two possible points on the ellipse, which would be something like the linear case).

alternatively, and kind-of equivalent, the least squares "best fit line" maximises the likelihood that my prediction of y, given x, is correct, assuming that the errors in y have a normal distribution. and more generally, "least squares" is mathematically equivalent to assuming that something is normally distributed - in this case, what is assumed to be normaly distributed?

in other words - i can make no sense of what you're doing. i can see that you can write a program and get numbers out the end, but i can't see what those numbers mean. can you explain? thanks.
posted by andrew cooke at 2:21 PM on April 12, 2006


The advantage of the SVD is that it can be used to form the total least-squares fit from this data.

The advantage of the normal equations is that they don't require an SVD! ;-)

You could also use the following equation, which would directly give more meaningful parameters, but, being an implicit model, is more difficult to do least-squares with:

(x-x0)^2/a^2 + (y-y0)^2/b^2 = 1

This model would, I believe, minimize the tangential distance of the points to the ellipse, not just the y-distance.
posted by Elpoca at 2:24 PM on April 12, 2006


sorry, i'm slightly inconsistent above. if we take the simplest case of errors only in y, then it's the vertical distance (not the perpendicular) (this doesn't really change anything; i just don't want to get lost in side-details due to a small mistake).
posted by andrew cooke at 2:26 PM on April 12, 2006


i can see what elpoca is doing; i don't think onalark's process will give the same ellipse, will it?
posted by andrew cooke at 2:31 PM on April 12, 2006


andrew, you may be confused in that once I've solved the function for the coefficients, to plot an ellipse I either change to radial coordinates and evaluate from [0,2pi] or in Cartesian coordinates, plot twice. And it minimizes the sum of y-distances from each x-point to the nearest part of the ellipse, just like the linear equivalent.
posted by onalark at 2:35 PM on April 12, 2006


I'm not doing Elpoca's model, but I can do that by taking the total least squares approach, which is explicit. I specifically reformulated the general equation for an ellipse so I could solve it like a linear least squares problem. It's not necessarily the best way to do it, but it is A way.
posted by onalark at 2:41 PM on April 12, 2006


I've always understood the purpose of line-fitting and curve-fitting being explicity to model a relation.

I am having trouble understanding how fitting an ellipse to data could serve this purpose or any other. Would someone be so kind as to explain a plausible purpose to me?
posted by ikkyu2 at 3:02 PM on April 12, 2006


ikkyu2:

Because the boss wants me to fit an ellipse around some data.
posted by geonags at 3:06 PM on April 12, 2006


i think i'm confused because i'm used to thinking of statistics in terms of likelihood maximisation - in elpoca's case maximising the likelihood that x and y are independently drawn from two distributions with variance a^2 and b^2 (which i don't believe is the same as minimizing tangential (or even perpendicular) distances between points and the implied ellipse, but makes some sense).

in other words, i'm used to being able to see some underlying meaning to the process (that the results are "best" in some well-defined way; a bayesian, probabilistic, wat).

i honestly don't know if what you're doing is equivalent or not (well, it's not exactly equivalent to elpoca because your model allows the ellipse to "rotate"), but i guess that if you worked backwards to a maximum likelihood/bayesian standpoint you'd end up seeing that you've assumed a very odd error distribution.

mainly i was just curious, because i think ikkyu2 had a good point - the processes that have been described are not (as far as i can tell) equivalent to "fitting a line". that doesn't mean they're wrong - i was just curious what they were "really" doing (there's a pretty big hint that they're not minimising distance from point to curve because if you've ever done graphics programming you'll know that the hardest part is deciding which part of a convex curve you're intersecting with a line (which will intersect the curve an even number of times); there's nothing in the models above that suggests that the maths is "choosing" between one side of the ellipse and the other. i realise that's a very intuitive argument, but i think it makes sense).

on preview: if you're seriously asking, ikkyu2, then i think the answer is that traditionally, confidence limits are often represented as plotted ellipses (when there are two variables involved). so in more exact terms, what is being fitted is a model of the error distribution (as a bivariate gaussian). the "eclipse" is just a graphical representation of that model. that sounds like a huge mess, but in fact the maths (making the usual assumptions) ends up looking very "suggestive" of "fitting an ellipse".
posted by andrew cooke at 3:18 PM on April 12, 2006


(so it would be better to say "fitting a bivariate gaussian"; the ellipse is actually a contour on that model)
posted by andrew cooke at 3:20 PM on April 12, 2006


a better argument for seeing that this is not minimizing distance squared from points to (elliptical) curves is to see that with a gaussian distribution you will eventually get a point very far away. there's no way to get a point "very far inside" to "balance" that, because you end up leaving the ellipse on the other side....

sorry for the sidetrack geonags - what you have done is actually pretty reasonable imho.
posted by andrew cooke at 3:28 PM on April 12, 2006


In the equation I gave, a and b are the semi-minor and semi-major axes of the fitted ellipse, not the variances of the input x and y co-ordinates.

The results are best purely in a least-squares sense: the sum of the squares of the function's residuals is minimised. No statistical information or assumptions are required for this. However, if the x and y co-ordinate errors are normally distributed, then the least-squares estimate will also be the maximum-likelihood estimate.
posted by Elpoca at 3:52 PM on April 12, 2006


last comment on this, in reply to elpoca.

you can use your formula for two different things:
1 - to fit to a "ring" of points, that trace out an ellipse
2 - to fit to a "scatterplot" of points, where the "central bunch" are surrounded by an ellipse.

(1) is what i would call "fitting an ellipse"; (2) is what i would call "fitting a bivariate gaussian" (and is what the original question was about).

in (1) a and b describe the ellipse directly. in (2) a and b describe the standard error in x and y, and indirectly describe an ellipse that is a contour of the probability distribution that x and y are drawn from.

both of those interpretations (1 and 2) are valid in a least squares, following-the-rules-blindly sense. however, only (2) has a (simple) maximum likelihood model-fitting interpretation.

in particular, your comment "if the x and y co-ordinate errors are normally distributed, then the least-squares estimate will also be the maximum-likelihood estimate" only applies to (2).
posted by andrew cooke at 4:33 PM on April 12, 2006



I've always understood the purpose of line-fitting and curve-fitting being explicity to model a relation.

I am having trouble understanding how fitting an ellipse to data could serve this purpose or any other. Would someone be so kind as to explain a plausible purpose to me?


In my example maths, I was just demonstrating a way to fit a couple observations (say of an interstellar object along a 2-d elliptical path) to figure out what the most likely path is.

If I knew ahead of time that my x measurements were precise, I'd just do least squares. If on the other hand I knew there was error in both, I would do total least squares.

I don't think the curve-fitting algorithm is an ideal method for drawing an ellipse around some arbitrary dataset. As I mentioned before, that particular monster is a constrained optimization problem.
posted by onalark at 5:18 PM on April 12, 2006


AC, onalark, geonags: thanks for your replies.
posted by ikkyu2 at 7:54 PM on April 12, 2006


(1) is what i would call "fitting an ellipse"; (2) is what i would call "fitting a bivariate gaussian" (and is what the original question was about).

You're right. Sorry, I got sidetracked.

So this next comment is off-topic, but....

in particular, your comment "if the x and y co-ordinate errors are normally distributed, then the least-squares estimate will also be the maximum-likelihood estimate" only applies to (2).

No! An ellipse fit to a "ring" of points by least-squares will be the maximum-likelihood estimate, again, provided that the co-ordinate errors are normally distributed.
posted by Elpoca at 8:41 AM on April 13, 2006


here is a recipe that gives the ellipse (or set of ellipses) that i think your boss wants. in fact, there are two recipes; the first uses several columns in your spreadsheet; the second uses less columns but more formulae and is more likely to contain errors because of numerical rounding and because i've made a mistake, but may be easier to write as a reusable formula.

at the end i will explain what the ellipse means.

recipe 1
  1. start with two columns x and y, which are the coordinates of the points in your scatterplot. from those columns, calculate the following variables:
    • sumx = sum(x)
    • sumy = sum(y)
    • sumxx = sum(x*x)
    • sumyy = sum(y*y)
    • sumxy = sum(x*y)
    • n = number of points
  2. from those values, calculate the average (xbar, ybar) and the variance and covariance:
    • xbar = sumx/n
    • ybar = sumy/n
    • varx = sumxx/n
    • vary = sumyy/n
    • covarxy = sumxy/n
  3. generate two new columns, dx and dy where:
    • dx = x-xbar
    • dy = y-ybar
    these should be the same scatterplot as before, but shifted to be about the origin.
  4. calculate the following, which are the same as above, but for the new columns:
    • sumdxdx = sum(dx*dx)
    • sumdydy = sum(dy*dy)
    • sumdxdy = sum(dx*dy)
  5. calculate the angle theta = 0.5 * arctan(2*sumdxdx / (sumdydy*sumdxdx)) which is the angle that the ellipse is "rotated" from the horizontal, and also:
    • c = cos(theta)
    • s = sin(theta)
  6. generate two new columns X and Y (if you can't use capitals change the names!) which should be the same scatterplot, but with the rotation removed:
    • X = c*dx - s*dy
    • Y = s*dx + c*dy
  7. as before, generate the following from the new columns:
    • sumXX = sum(X*X)
    • sumYY = sum(Y*Y)
    • varX = sumXX/n
    • varY = sumYY/n
  8. finally(!) calculate:
    • a = sqrt(varX)
    • b = sqrt(varY)
    these are the lengths of the semi-major and semi-minor axes (the two principal "radii") of the ellipse (which is which depends on your data). the basic ellipse that you want to plot has that size, is centred on xbar, ybar, and is rotated by the angle theta. see the explanation below for what this means and how to generate other (larger) ellipses.
recipe 2
using just the x and y columns, and appropriate formulae above, you can calculate everything using:
  • vardx = varx - xbar*xbar
  • vardy = vary - ybar*ybar
  • covardxdy = covarxy - xbar*ybar
  • varX = c*c*vardx - c*s*covardxdx + s*s*vardy
  • varY = s*s*vardx + c*s*covardxdy + c*c*vardy
explanation
traditional statistics often assumes that noisy data is distributed as a "gaussian" or "normal" distribution (this is justified by the famous "central limit theorem" that says you get this distribution when life is complicated). the process above is equivalent to fitting a model for that distribution when it describes two, correlated variables (x and y). the final values a and b are the "standard deviations" of the underlying, uncorrelated distribution (ie with the rotation removed).
section 15 of numerical recipes describes things in more detail (especially section 15.6).
if your data really do follow this distribution then a and b give the relative sizes of your ellipse. you can scale that ellipse to include any given fraction of the distribution.
disclaimer - i am not 100% certain about this next bit: in particular, the usual ellipse plotted is one that contains 68% of the data (it's a convention), which you would get by multiplying a and b by the square root of 2.3 (see the reference above to numerical recipes).
posted by andrew cooke at 6:44 AM on April 14, 2006


there's a BIG mistake in the expression for theta. the numerator should be sumdxdy and the denominator is a subtraction:
theta = 0.5 * arctan( 2*sumdxdy / (sumdydy - sumdxdx) )
note that it's "y minus x" in the denominator. sorry!
posted by andrew cooke at 7:26 AM on April 14, 2006


« Older Shrimps and rice, so very nice!   |   Fix my cake recipe Newer »
This thread is closed to new comments.