StatFi
June 12, 2006 1:55 AM   Subscribe

If you truncate a distribution, how do you figure out the new mean? For example, if you know the average height is 6 foot, how do you figure out the average height of the top 50%, lower 30%, etc? You can make up a standard deviation, it's just an example. Yes, I am dumb.
posted by fucker to Education (12 answers total) 1 user marked this as a favorite
 
How much information do you have? IANA statistician, but it seems to me that you can't calculate the average of the top 50% given only the global average.

Consider these two sets:
A. 5', 7'
B. 4', 8'
The averages are the same (6'), but the averages of the top 50% are different (7' for A, 8' for B).

If you have all the samples, on the other hand, then just sort them, discard the lower half, and average the rest.

If you don't have the samples, but you do have more than just the average, surely you can come up with some kind of approximation. (I think knowing the shape of the distribution would be particularly helpful.) But I must leave that for someone who actually knows this stuff...
posted by equalpants at 3:12 AM on June 12, 2006


If your distribution is normal and you know the mean and std/variance, you can use the expression on this wikipedia page. It's expressed in terms of the cumulative distribution function and probability density function for the normal distribution which are defined on the same page.
posted by teleskiving at 4:20 AM on June 12, 2006


Why do you want to do this? If you are taking the top half of a normal distribution, it will not be normal, so I don't know what the mean gets you. Are you sure you wouldn't rather have the median? If you do have the original data, then just use that. If you only have parameters, then you can do a simulation

Here's a simulation in R, with the standard deviation set to 1:

> X = rnorm(n=1000000, mean=6, sd=1)
> mean(X[findInterval(X, 6) == 1])
[1] 6.798343
> median(X[findInterval(X, 6) == 1])
[1] 6.674617


Of course the latter is the same thing as the quantile function (equation on the same Wikipedia page above):
> qnorm(p=0.75, mean=6, sd=1)
[1] 6.67449

posted by grouse at 6:16 AM on June 12, 2006


Just to clarify, you want the mean of the remaining group, after you remove the top 50%, bottom 30%, whatever, right?

You don't want to know what the data point at the 30th percentile is, right?
posted by Kwantsar at 6:54 AM on June 12, 2006


Response by poster: Kwantsar: Right. Right. If the average height of 200 freshman is 6 foot, with a standard deviation of say 3 inches, and I completely get rid of the lower 50%, I assume I can figure out the new (even taller) average height of the remaining top 50%. Right? How?

I guess the opposite is kind of what I want too. If I know the average height of the freshman basketball team is 6 foot 6 inches, with a standard deviation of 2 inches, and that the basketball players are the tallest 25% of the freshman class, can I figure out the average height of the freshman class?

grouse: I don't know how to read your equation. Maybe I do want the median, I don't get how it works. My questions themselves may well be faulty.
posted by fucker at 7:30 AM on June 12, 2006


To add to my previous post, the same Wikipedia page also gives the inverse cumulative distribution function which you need in order to determine your lower truncation point.

If you're uncomfortable with using the information I've linked to, maybe someone will write it out step-by-step. But you will at least need to get erf from somewhere.

With regards to the opposite question, if the heights of the basketball players are normally distributed then they can't be exactly the upper tail of a normal distribution of student heights. Assuming that what you have is exactly the mean and std of the tail, there's bound to be a way of estimating the average height of the student population on that basis but I'll be impressed if there is an analytical solution!
posted by teleskiving at 8:00 AM on June 12, 2006


To the OP: Why do you want to do this? Or is it just academic interest? Answering this question properly really depends on this.

If the average height of 200 freshman is 6 foot, with a standard deviation of say 3 inches, and I completely get rid of the lower 50%, I assume I can figure out the new (even taller) average height of the remaining top 50%. Right? How?

No, but you can estimate it. The easiest way to do so is to use a computer program to do a simulation involving the parameters you have. I have given an example using the statistics language R.

I read the question to mean that you want a measure of central tendency (or average) for the top 50 % of the population. There are several different measures of this. In general, using the arithmetic mean only makes sense if you can assume that the data fits to certain kinds of distributions (such as the normal and uniform distributions). Since you have taken the top half of what you assume is a normal distribution, you can be pretty sure that the subset will be skewed and the median will be a better choice. This also has the nice side-effect that the answer for your question if you use median instead of mean can be easily calculated.

I find it unlikely that all of the tallest people in a class would be on the team. They are more likely to be sampled from the tall people. If your assumption were correct, however, I don't think you would be able to figure out the average height of the class only by knowing the data point at the 75 percentile. You would need at least one more data point.
posted by grouse at 8:01 AM on June 12, 2006


Hmmm, I wasn't thinking for a second, I guess you would have other data points. Sorry.
posted by grouse at 8:02 AM on June 12, 2006


Let me restate your question, as I understand it:

Someone tells you the average and standard deviation for some data set. You do not have access to the original data; just the mean and the standard deviation. It is implied, since you are given a standard deviation, that the underlying distribution is normal. Now you want to know the mean of some subset of the data; for example, the mean of the values larger than the overall mean.

You are correct in assuming that you have enough information to compute this. The problem, as I understand it, is that the normal distribution function is difficult to analytically integrate. As a consequence there are no simple formulas for what you want. You need to use a numerical solution (ie: a computer program).

Now, the mean (or the average) is just (x1 + x2 + ... + xn)/n. This ignores weights in order to keep things simple. On the other hand, the median is the data point for which half the data points are larger and half are smaller. I believe this is what most people understand average to mean (although they have been taught to compute it correctly). In a normal distribution, the mean and the median are the same; in most other distributions, they are not. In particular, in your "top 25% of a normal distribution" problem, they are not the same.

Or, in other words, what grouse wrote.
posted by treeshade at 9:47 AM on June 12, 2006


Response by poster: 1: No, but you can estimate it. The easiest way to do so is to use a computer program to do a simulation involving the parameters you have. I have given an example using the statistics language R.

2: You are correct in assuming that you have enough information to compute this. The problem, as I understand it, is that the normal distribution function is difficult to analytically integrate. As a consequence there are no simple formulas for what you want. You need to use a numerical solution (ie: a computer program).


Ok, so first I need to know a lot more about statistics and buy a computer program, and learn how to use that program, and learn a computer language?! Eek. Thanks though.
posted by fucker at 10:30 AM on June 12, 2006


You don't need to buy it, R is free and open-source.

You've asked two questions.

First, you want to know something about the upper half of a distribution, given that you know something about the whole distribution. This seems like a job for the folded-normal distribution, aka the half-normal. You can probably find the mean and median for the folded version of the standard normal distribution (mean 0, sd 1) somewhere on the net.

Second, you want to know something about a whole distribution, given that you know something about the tail. This seems a very Bayesian problem. Your prior knowledge (or assumption) is that the whole distribution is normally distributed and that the sample you have is the top 25% of the whole. Based on that, what normal distribution is most sensible? I don't know enough about Bayesian statistics to have a sense of how that would be expressed in software.
posted by ROU_Xenophobe at 11:32 AM on June 12, 2006


If you are just looking for quick back of the envelope type stuff to characterize the two tails of a normal distribution:

Suppose the mean is X and the standard deviation is s. Or, to take your example, the mean is 72 inches and the standard deviation is 6 inches.

The 5th percentile is X - 2s = 60 inches.
The 32nd percentile is X - s = 66 inches.
The 68th percentile is X + s = 78 inches.
The 95th percentile is X + 2s = 84 inches.

Statistically-minded people prefer to talk about the tails of a distribution in this way. Measuring the central tendency (average, mean, median, mode) of just a part of a normal distribution is sort of counterintuitive to the statistician, because it makes her discard the mathematical tools she has that allow her to know a great deal more about the curve than just its central tendency.
posted by ikkyu2 at 1:18 PM on June 12, 2006


« Older How to Backup my Powerbook to DVDs   |   A Human Shedding Hair Like a Cat Newer »
This thread is closed to new comments.