mean and median differences?
September 25, 2006 9:44 PM   Subscribe

In a least squares regression model what does it mean when the median is larger than the mean?

My friend was doing a least squares regression model on the periodic table of elements and created a model that predicted an elements weight based on atomic number. This was created with the first 26 elements in the table. When she found the mean and median of the atomic weights, she found that one of them was larger by about 0.3 (i forget which).

She ended up concluding that one of them was larger (either mean or median) because the atomic weight increased faster with respect to the atomic number. I know in a bellcurve that if mean or median is larger than the other then the distribution is skewed. However, does this have any meaning at all in a least squares regression model? I have taken stats before, and chemistry, and i was skeptical of this. I think it's simply due to variability. Can anyone confirm which is right and include an explanation? I'm not doubting the chemistry here, just doubtful that the differing of the mean and median for the atomic weight doesnt merit the aformentioned conclusion.
posted by EvilKenji to Education (13 answers total)
 
What do the mean and median have to do with an OLS model?

I mean, what do you mean by she found the mean and median of atomic weights? You don't need OLS to do this. OLS would be utterly superflous.
posted by ROU_Xenophobe at 10:25 PM on September 25, 2006


Response by poster: The mean and median dont have much to do with the OLS model, i know.

Perhaps i should elaborate on my friends assignment. She was to make an least squares model for a set of data. However, one of the questions on her assignment asked to find the mean and median of the _sample data_ (not the estimated data) and to analyze what this means. If she had chosen sample data, say the heights and weights of a group of people, this sample data would have most likely been normally distributed so this question would make sense...however she chose her data from the atomic weight and atomic number of the periodic table which is not distributed in the same way.


Because of this i at first told her the differences she found in the mean and median of the sample data from the atomic weights either meant nothing or was due to variance. My chemistry friends (who have never taken stats) claimed that this difference in the mean and median of the sample atomic weights data means that the atomic weight increases faster as you go up the periodic table than the atomic number.
posted by EvilKenji at 10:34 PM on September 25, 2006


If the distribution of atomic weights was symmetric, then the mean and the median would be the same value. Because the median is larger than the mean, that often indicates that the distribution is negatively skewed. That is, there is a longer "tail" on the left side of the distribution, which pulls the mean to the left, leaving the median at a higher value.
posted by naturesgreatestmiracle at 10:43 PM on September 25, 2006


One of the assumptions of the linear regression model is that the underlying data have a normal (Gaussian) distribution. If there is a skew to the data, it no longer follows the bellcurve distribution and the model may not apply.
posted by Blazecock Pileon at 10:54 PM on September 25, 2006


Pearson's second skewness coefficient is the mean-median... but multiplied by 3/σ. I think doing this may give you a better insight into whether it's due to variability in the data or a significant skew.

You'll want to Google the coefficient if you want to do any sort of a rigorous test with it however.
posted by edd at 12:31 AM on September 26, 2006


My interpretation is that trivial differences between median and mean can be ignored as variance. Large differences between median and mean is a sign that you should reconsider using statistical methods based on the normal curve, and think about nonparametric statistics or transformation functions. The key here is to make a scatterplot of the data.
posted by KirkJobSluder at 5:14 AM on September 26, 2006


Also, another easy diagnostic plot is to do a local regression of the data (loess) on top of your scatter plot to see whether it behaves linearly throughout the whole range.
posted by grouse at 6:33 AM on September 26, 2006


Blazecock and KirkJobSluder have it, I think. Mean=median is an assumption of linear regression, as others have said. Differences may be due to sample error or could be systematic. You need to look at a residuals scatterplot and make sure that linear regression is appropriate. If it is ugly (unequal variance, curvature, or significant outliers), you can try to clean it up by using a transformation or the like. If this fails, you should move beyond linear modeling.
posted by jtfowl0 at 6:50 AM on September 26, 2006


Also, it sounds like you are dealing with a non-random sample (first 26 elements) from the periodic table. Your problems could possibly be attributed to this if the first 26 elements are non-representative of the whole periodic table. IANA chemist, so I'm not sure if this is a problem or not.
posted by jtfowl0 at 6:54 AM on September 26, 2006


I think "linear regression" is being confused with "linear relationship."

It sounds to me as though the mean in this case refers to the average weight of an atom, whereas the median is the weight of the atom with the middle atomic number.

For a simple line of the form y=mx+b, the mean and median will always be the same. However, linear regression is not just limited to fitting lines. You can do a linear regression for polynomials of any degree you like.

Assume some data points are taken at evenly spaced points from [0,1] in the independent variable (x), and their values are found to follow a simple function, x^2. The average value of the data is 1/3 (the integral over the interval). The median value however is (1/2)^2 = 0.25 (the value of the middle data point). In this case, the median is lower, because x^2 has an ever increasing first derivative.

Now consider atoms. The mass is given by the mass number, A = N + Z. While N is nearly proportional to Z, it's not exactly a line. You can get a good estimate for what N will actually be by minimizing the binding energy in the semi-empirical mass formula. Thus, the relationship between mass and atomic number, Z, is not strictly linear. This is well known, so it makes doing a linear regression using a polynomial of order 1 (a line) somewhat inappropriate. So, you go to higher orders--maybe you do a quadratic fit. In that case, you will in general get a mean and median that are different.

I believe the mean is actually higher than the median in this case, reflecting the fact that as the atomic number goes up, it takes increasingly more neutrons to keep the protons from flying apart due to the Coulomb interaction.
posted by dsword at 8:06 AM on September 26, 2006


Best answer: Y'all are making this out to be much more than it is.

Does atomic mass increase faster than atomic number? Dunno. ISTR that sometimes atomic mass actually goes down as atomic number increases because of isotope mixes.

Does mean>median mean that? No. Mean>median only means that the distribution is slightly skewed. The median and mean of atomic mass don't give a shit what atomic number is doing. You could have the mean of atomic mass be greater than the median even if atomic mass went down as atomic number increased, or if mass and number were utterly uncorrelated.

Would this be a particularly smart regression to run in the real world? Probably not, or reasons that have nothing to do with statistics. If you wanted to predict the atomic mass of element 140, doing so from a regression of atomic number and mass would be silly. The right way to do that would be by building on atomic theory to figure out how many neutrons you'd need to (briefly) hold together an atom with 140 protons. But who cares? It was an intro stats problem, not somebody's dissertation.

Do the data as reported indicate any real problem? No. Any regression run with real data is going to be slightly heteroskedastic and slightly skewed and measured with some slight error and so on, and is almost certain to omit at least one variable. The Gauss-Markov assumptions are like assuming perfect competition. The various corrections are for actual no-shit deviations from the assumptions that are big enough to actually and for real fuck up your inferences. If anything, early students need to learn to apply corrections conservatively as their instincts seem to be to include quadratic and cubic terms all over the place just in case things are nonlinear and so on.
posted by ROU_Xenophobe at 8:09 AM on September 26, 2006


ISTR that sometimes atomic mass actually goes down as atomic number increases because of isotope mixes.

Not in the first 26 elements, but yes, this is true later in the table.

This does reflect a difference between rates of change in atomic number and mass, but only because atomic number increases linearly and therefore serves as a good independent variable for plotting atomic mass (note that I'm not exactly refuting what ROU_Xenophobe has just said...it's not fundamentally because of a difference between increase in atomic mass vs. number, it's only fundamentally caused by the increase in mass being nonlinear. But because the nonlinearity is due solely to the neutron count (because increasing atomic number IS linear, by definition), it's useful to view it that way. As dsword points out, what it reflects is the nonlinear increase in the number of neutrons...on average, as you go up in atomic number, the rate of addition of neutrons increases. Up until around chlorine, you get roughly the same amount of neutrons as protons, but after that you get more neutrons. To put it another way, the ratio of neutrons to protons in light elements is 1:1, for heavier elements (around the lanthanides and later) it's closer to about 1.5:1. This isn't a useful predictive model for any given element, there are significant variations, especially in the intermediate region, but it averages about right.
posted by solotoro at 9:29 AM on September 26, 2006


Do the data as reported indicate any real problem? No.

That's up for debate, given what data has been reported (not much, really), but certainly there are statistical tests to measure the normality of data, and thereby to determine whether nonparametric regression provides better results and therefore should be used.
posted by Blazecock Pileon at 7:04 PM on September 26, 2006


« Older Help me identify an oldish animated short!   |   The time travelling tourist takes a picture Newer »
This thread is closed to new comments.