Averaging standard deviations for combined populations
July 20, 2015 3:35 AM   Subscribe

This seems like an easy question but my math skills are underdeveloped, and I do not understand the answers I've found through Google. I would like to know how to appropriately average together standard deviations when all I have is the means, sample sizes, and standard deviations for two or more groups.

This page has an example that's approximately similar to what I'd like to figure out (but an answer that does not help me):

"For a group of 50 male workers the mean and standard deviation of their daily wages are $63 and $9 respectively. For a group of 40 female workers these values are $54 and $6 respectively."

OK, I want to combine the male and female data. The mean is straightforward: the weighted average wage for the 90 workers is $59. The weighted average standard deviation is $7.7, but averaging standard deviations like this seems obviously wrong to me (since combining men and women is expanding both the higher and lower end of the distribution). But I can't reason through how to do this, or (apparently) even how to find the answer online.
posted by fucker to Education (19 answers total) 1 user marked this as a favorite
 
I guess it depends on what you mean by "appropriately average together standard deviations". If you mean to compute the standard deviation you would get if you merged the two datasets and recomputed, well I don't think you can do that without the actual full dataset.

If you assumed a normal distribution for each dataset then you could combine them either with [fancy math of some sort] or by making up a bunch of points for each input distribution, Monte Carlo style, then combining and computing mean and std dev for all the points. But that is seriously hand-wavey stuff and will rapidly lead you astray if your data isn't normal.

Also I don't know much statistics, so keep that in mind.
posted by ryanrs at 3:57 AM on July 20, 2015 [3 favorites]


There's a formula sheet for the "AP Statistics" exam that tells you how to do it....
posted by Jon44 at 4:31 AM on July 20, 2015


What's wrong with the webpage that you linked to? The standard deviation is the square root of the variance, so in the example on that page, the standard deviation of the combined data set would be √81 = $9.
posted by Johnny Assay at 4:40 AM on July 20, 2015 [2 favorites]


Response by poster: Well I have one person saying it's impossible (comment with most favorites) and two saying it isn't, so I'm getting more confused.

What's wrong with the page I link to? I'm not savvy enough to understand the text or interpret the equations. All I could tell was that it ended with "= 81", which was obviously not the answer I was looking for.

But ok, I'm supposed to take the square root. I'll go try to plug my own numbers into their equation, and see if I can get something that resembles a plausible answer.
posted by fucker at 5:44 AM on July 20, 2015


Bottom line: estimating statistics from other statistics, rather than a fresh sample of the population, is a recipe for disaster - especially when, as in this case, you really have two populations with different fundamental characteristics. You can trivially derive the weighted mean of the male and female samples, but it's really not meaningful in any way, since it's going to inevitably fall between the two groups. The deviations, meanwhile, would now have to be calculated from the new mean, but since you don't know anything about the individual points making up these samples, you can't do that: standard deviations care about the squared deviations from the mean, so you can't just apply a linear transformation to the statistic to adjust for a change in mean.

Unfortunately, you're asking for the impossible.
posted by fifthrider at 6:14 AM on July 20, 2015 [6 favorites]


sorry for my previous (now deleted) answers. i'll try again, more slowly.

first, i think you are confusing variance and standard deviation? the answer of 81 in the page you linked to is variance. so the standard deviation is 9 (the square root of that). so that may be the answer you are looking for.

second, is that page describing what you really want to do? there are lots of ways to combine means and standard deviations. and the "right" answer depends on exactly what you are trying to do.

the page you linked to is saying something like: we measured some results for men. then we measured some results for women. now we'd like to know what answer we would have got if we had measured men and women together.

the problem with that is that much of the maths about means and standard deviations is based on the idea that the values are distributed "normally" - in a gaussian curve, with a single "hump". but men's and women's wages are probably not distributed like that. they probably come from a curve (a "population") with two humps. so adding them all together is a bit worrying.

also, at that site, they seem to have some mistakes in their formulae (the means are not squared consistently as far as i can tell). i am a bit worried about the site in general, so i looked for an alternative and found this page which has an example and a formula (near the bottom) that is much clearer. it should be the same as the one on the page you linked to, but i trust it more.

so if what i describe above is what you really want then i suggest using the formula at the page i linked to. the example there is worth reading too.
posted by andrewcooke at 6:34 AM on July 20, 2015 [2 favorites]


What's wrong with the webpage that you linked to?
Everywhere it says Xc2 it should be saying Xc with a bar over it. Other than that it looks okay at first glance.

Standard deviation is a well-defined formula whether or not the data is in a Gaussian distribution, so you certainly can compute it, and the formulas on these pages look reasonable (they may look different, but assuming that they're correct, that's because you can group terms in different ways, plus some people divide by n-1 instead of n for unimportant reasons). It may not have quite the meaning that you expect, though (if you are expecting it to mean the width of a bell curve).
posted by dfan at 6:39 AM on July 20, 2015 [3 favorites]


The reason you're getting the different responses is, I think, that the different formulas being cited here are reasonable estimates of the pooled standard deviation but none of them are the exact pooled standard deviation (except by dumb luck).

Think about what the sd is -- the square root of the variance. And the variance is the sum of squared deviations from the mean divided by N. For each observation, it's the difference between itself and the mean, squared.

If we want to put two samples together, we can easily get the combined mean -- it's just the weighted average of the two means. But the combined sd is not the weighted average of the two sd's because of all that squaring and square-rooting. What you need to get the combined sd is the combined variance, which is the average squared deviation from the combined mean. The formula you linked to and the other from stackexchange are slightly different but both reasonable ways to estimate the new sum of squared deviations from the information you have available. What both formulas are doing is trying to put together the new SSD from the old SSDs plus the difference between each individual mean and the combined mean.

The reason some of the answers here are saying that it's impossible is that you're only going to get an estimate of the combined sd, and one that is virtually guaranteed to be a little bit off. The reason is that you can't fully reconstruct the data from the summary statistics. You know that the sum of squared deviations in Sample 1 is, say, 100, but you don't know how that 100 was put together -- were the squared deviations spread more or less evenly around 0, so you got the 100 with squared deviations of 5 and 10 and 20? Or did you get the 100 with 30 squared deviations of 1 and one of 70? The exact combined sd depends on exactly how the original sample SSDs were put together, and you don't get to know that.

Personally, I don't think that an insistence on statistical purity is usually helpful, since estimating something is usually better than whatever default decision arises from admitting defeat before you even started. But you might think about what the stakes in your actual example are. Are lives or lots of money on the line, like if these were the doses in individual pills from two different machines? Then it's worth spending some effort to get the actual observations back and compute the combined sd directly. But if it's of vague, mostly academic interest -- you only have some summary statistics about people in each of the states in 1860 and you want to make some statements about the union and confederate populations -- then either of the formulas here are good enough.
posted by ROU_Xenophobe at 7:19 AM on July 20, 2015 [3 favorites]


@ROU_Xenophobe - be careful - i am pretty sure that the formula for the pooled mean square deviation is exact. the proof proceeds by expanding the sum and discarding things that sum to zero, and the result is the formula in the first link.

so if by "standard deviation" you mean "of an assumed normal distribution" then you're right (the problem is that the assumption of normality is incorrect). but if you mean "square root of the mean square deviation" then you're wrong (there's no problem with the maths, as far as i know - it's exact, not an approximation) (i haven't checked this particular formula, but from what i remember of this type of proof, it is ok).

i think this what dfan implies too, by "you certainly can compute it".
posted by andrewcooke at 8:02 AM on July 20, 2015 [1 favorite]


The deviations, meanwhile, would now have to be calculated from the new mean, but since you don't know anything about the individual points making up these samples, you can't do that: standard deviations care about the squared deviations from the mean, so you can't just apply a linear transformation to the statistic to adjust for a change in mean.

This is true, but the variance depends linearly on the sum of the squares of the data:

σ2 = [ Σ (xi - x̄)2 ]/n = [ Σ xi2 ]/n - x̄2.

(The latter identity can be proven by expanding out the square in the first sum.) This means, in particular, that if you know the sample size, mean, and standard deviation of a population, then the sum of the squares of the data points is given by

n ( σ2 + x̄2) = Σ xi2.

This means that if you know n, σ, and x̄ for each individual sub-population, you can calculate Σ xi2 for each subpopulation; the Σ xi2 for the total population is then the sum of these numbers, and from this you can use the above formula in reverse to find σ for the population as a whole. (This is basically a sketch of how to derive the formula found by andrewcooke.)

so if what i describe above is what you really want then i suggest using the formula at the page i linked to. the example there is worth reading too.

There's a small error in that formula as well—the second n1 in the first set of brackets should be an n2. I've put in a request for an edit, but it won't show up until it's approved by the community over there.
posted by Johnny Assay at 8:07 AM on July 20, 2015 [3 favorites]


Anyone who claims that an exact formula for combined variance does not exist should be able to supply a counterexample: supply populations X1, X2, and Y such that X1 and X2 have the same mean and variance, but the combined population of X1 and Y has a different variance from that of the combined population of X2 and Y. (I was unable to do so.) That would certainly make it clear that there is not an exact formula.
posted by dfan at 8:22 AM on July 20, 2015


On the broader issue of whether it can be done or not, and whether or not it is meaningful: I think the main distinction that needs to be drawn is whether you have complete population data, or just sample data. In the example you linked to, there is an implicit assumption that we are able to get a complete set of data for the population in question. If you do have complete data, then the formula above is exactly correct. For example, say you're looking at a company that has 50 male employees and 40 female employees; then the mean wage of the employees and the standard deviation among the wages of all employees can be calculated exactly from knowing the means and standard deviations of the wages among the men as a group and among the women as a group, and andrewcooke's formula is exact in this case.

If, on the other hand, the 50 men and 40 women are drawn from a larger population, and you're using the standard deviation of the samples to estimate the spread of values in the population, then that's a much thornier issue. There are probably ways to do it, but you'd need to do things like account for the relative numbers of of men and women in the company. (Is it exactly 5:4? Probably not, especially given the way sample sizes are chosen—they are usually not directly proportional to the size of the populations in question.)
posted by Johnny Assay at 8:40 AM on July 20, 2015 [2 favorites]


i am pretty sure that the formula for the pooled mean square deviation is exact

This is what I get for cooking examples in excel.
posted by ROU_Xenophobe at 8:48 AM on July 20, 2015 [1 favorite]


You want the formula for the "linear combination of variance". It gets you the variance, but obviously you can calculate SD from that. So google that phrase. Here's one tutorial page on that.
posted by If only I had a penguin... at 9:19 AM on July 20, 2015


what penguin has linked to is the answer to a different question: something more like, if i added the mean of the men's wages to the mean of the women's ages, what would be the standard deviation of the result?

that is not what you, OP, seemed to be asking. which is why it contradicts everything else here.
posted by andrewcooke at 9:25 AM on July 20, 2015


On more careful reading, it looks like what you're doing isn't linear combination of variance. You're not creating a variable that's a linear combination of other variables, you're adding more cases. So you're only "adding" in the sense of creating additional cases, not of combining linearly by addition.
posted by If only I had a penguin... at 9:25 AM on July 20, 2015


Yes, what Andrew said. My bad. Sorry.
posted by If only I had a penguin... at 9:27 AM on July 20, 2015


yeah, i actually said the same in an answer that i then asked the mods to delete (taz or whoever - feel free to delete my replies here if you want to clean this up). it's all a bit confusing.
posted by andrewcooke at 9:29 AM on July 20, 2015


Response by poster: Based on Johnny Assay's tip I've been using the formula from the page I originally linked to, and it seems to be giving plausible results.
posted by fucker at 2:42 PM on July 20, 2015


« Older Well, this is awkward...   |   Amsterdam/Netherlands ideas Newer »
This thread is closed to new comments.