# Why leave out that one error bar?January 24, 2006 3:37 PM   Subscribe

A statistics / scientific convention question. I've noticed in scientific journals that often when a set of data is presented with values normalized to one of the sample groups, and the value for that sample group is arbitrarily set to 1, 10, 100 or whatever, to simplify interpretation, the variability/error data for that one sample group is left out. Is there a good statistical reason for that or is it just some random convention with no good reason?

Here's an example: you have a set of data on the height of trees according to their age (say trees that are 5, 10 and 20 years old). You calculate the mean height and standard deviation for each age group. For whatever reason, you want to normalize the mean values for all three groups to the 5-year-old group and set that value to 1 to present the data. My question is why would people not show the standard deviation (adjusted for the normalization) for the 5-year-old group along with those for the other two groups.
posted by shoos to Science & Nature (17 answers total)

I had a long explanation, but I couldn't explain it very well anyway, so here's a shorter one:
To account for different outside conditions when an experiment is repeated at a different time, it's often useful to always normalize to an internal control that was taken the same day as the original data set. So on April 11 you measure something and normalize to the April 11 control, and on May 15 you repeat the experiment and normalize to the May 15 control. That way you rule out external influences that are very different on both days. (Maybe the airco was on in May but not yet in April.) Since they're both normalized to the internal control, both sets of data have a 100% control sample, and other variations are really due to whatever you're measuring.

I can't explain this very well at all, and it doesn't fit with the tree example. But basically: the sets were individually set to the normalized value, and the error given is the one AFTER normalization (so it's 0 for the one that it's normalized to)
posted by easternblot at 3:59 PM on January 24, 2006

Easternblot, I understand what you mean, but that's a somewhat different question. The normalization you're describing is normalization to some relatively constant standard (say GAPDH signals in a northernblot :)) I'm talking about normalization to one of the experimental groups for the purpose of simplifying data interpretation, without showing the error/variation for the group to which the other groups are normalized.

I don't know if you have access, but figures 1B, 1C and 2C in this article published a few weeks ago show examples of what I'm talking about. (But, strangely, figure 5C does show the error for the normalizing group).
posted by shoos at 4:32 PM on January 24, 2006

My guess is that when you do the normalization, you set the standard deviation for the normal group to zero (i.e. the result for group is defined as having a value of exactly 1, 10, or 100). The error in measuring the normal group remains in the data, however, as it is propogated through to the other normalized values according to the standard technique.

Intuitively, this seems valid. For whatever my intuituion is worth....

As for figure 5C in that paper you've linked to, I have no idea.
posted by mr_roboto at 5:32 PM on January 24, 2006

shoos, that link needs a sign-in.

It's an unusual way to do it, but a valid way would be to express all the uncertainty in the normalized data. For example, consider a data set a ± ua, b ± ub, c ± uc. If one normalizes on say b, one could then plot a/b ± a/b*√(ua2 + ub2), b (with no error bar) and c/b ± c/b*√(ub2 +uc2).

I can't imagine how someone would think that that's a desirable way of doing things (unless b is an internal control), but its mathematically correct. Is this what your authors are doing?
posted by bonehead at 5:35 PM on January 24, 2006

d'oh! That middle term is of course b/b (with no error bar).

....and mr_roboto beats me to it anyway. JINX!
posted by bonehead at 5:38 PM on January 24, 2006

I agree with bonehead; I think it's clearer to just divide everything by an exact number equal to the mean of the normal group and leave all the error bars on.
posted by mr_roboto at 6:12 PM on January 24, 2006

Ok, here's an article (pdf) that should be accessible to anyone in which they do the same thing, in figures 4, 5 and 6. All that is said about the error bars is that they represent standard deviations.

Since I've never even heard of the method bonehead describes being used in biology research (the field I'm in), and haven't seen it suggested anywhere in the papers I've seen that do this sort of normalization, I'd doubt that that's what they are doing, although I may just be out of it.
posted by shoos at 6:54 PM on January 24, 2006

(and I see I managed to get the math wrong anyway. Those uncertainties in the square roots are relative uncertainties, not absolute ones).
posted by bonehead at 6:56 PM on January 24, 2006

The "it's clearer" hypothesis seems to be holding the day, but experimentally I think the general idea is this:

When you use normalization, you're not making any comparisons between your experimental data and the "internal control" group that you divided by.

Instead, you're comparing two different groups, each normalized on its own to an analogous baseline. So, the idea is that you don't need the variance for the normalization group, since you'll never ever run statistics on it. So, it just muddies the water and you can leave it out.

Of course, the rest of us have to believe that there are good reasons to choose a particular normalization factor. I've certainly seen papers that made no sense because the normalization was inappropriate - but usually the pre-normalization data has to be shown before one can get away with it.

(For example: you want to compare the growth rate of 20yr old trees between North and South America. To control for variation in tree type and whatnot, you normalize by the growth rate for 5 year old trees. In this scenario, you're not comparing anything to the population of 5 year old trees, so its variance is meaningless.

Does that jive?
posted by metaculpa at 6:58 PM on January 24, 2006

shoos writes "Since I've never even heard of the method bonehead describes being used in biology research (the field I'm in), and haven't seen it suggested anywhere in the papers I've seen that do this sort of normalization, I'd doubt that that's what they are doing, although I may just be out of it."

The method bonehead describes is run-of-the-mill error propagation (give or take a couple of typos). I wouldn't expect them to describe something so mundane.

metaculpa writes "In this scenario, you're not comparing anything to the population of 5 year old trees, so its variance is meaningless."

Hold on, though: if you're measuring the growth rate of the 5-year-old trees, you need to propagate through the error on that measurement to the normalized growth rates for the 20-year-old trees, right? So the variance on that measurement does matter in that it will increase the variance of your reported data.
posted by mr_roboto at 7:35 PM on January 24, 2006

I'm kinda surprised that the reviewers for MCB, a 9 or even 10-ish impact factor journal, let it slide. Either the authors got lucky, their findings were really astounding, or well, they got lucky with the reviewers they drew.
posted by PurplePorpoise at 8:04 PM on January 24, 2006

mr_roboto: But how can you? You're normalizing by the pooled mean of 5 yr old trees (measured at the same time as 20 yr old trees), not by a paired measurement of each 20yr old tree to its 5yr old self. So how can the error propagate?

That said, the MCB paper looks like they _are_ looking at differences between groups and their normalized counterpoint. That's a different thing altogether, and I do agree that they need error bars.
posted by metaculpa at 8:58 PM on January 24, 2006

I wouldn't expect them to describe something so mundane.

Mundane or not, you would expect them to point out what their error bars mean, wouldn't you?

And this is really common. I found those two articles just after 4-5 minutes of looking for examples.
posted by shoos at 9:37 PM on January 24, 2006

metaculpa writes "So how can the error propagate?"

I'm not sure I'm getting your point. This is just dividing one random variable by another random variable. The variance of the result is well-defined, and is easily calculated given the variances of the two input variables.

shoos writes "Mundane or not, you would expect them to point out what their error bars mean, wouldn't you?"

I would expect them to given the number of standard deviations represented by the error bars and the number of trials. I wouldn't expect them to explain their method of normalization, unless they did something nonstandard.

PurplePorpoise writes "I'm kinda surprised that the reviewers for MCB, a 9 or even 10-ish impact factor journal, let it slide."

Is there something fundamentally wrong with this approach?
posted by mr_roboto at 10:10 PM on January 24, 2006

No, I can't spell out why, exactly, this is a sub-optimal way of presenting data other than what shoos suspects to be a problem.

No, not necessarily wrong but it's just kind of sloppy (sorry, I'm having problems with my school's proxy from home right now, haven't read the paper) - if the authors said something in the text explaining why they chose to present the data as they did - hey, it's all ok.

But if a 9-ish journal's reviewers were ok with it, it's probably not a misrepresentation.

Looking at the abstract - it's an interesting finding but by no means a seminal ground-breaking work. siRNA is notorious for not yielding extremely cleancut results which may have tempere the acceptance of the statistical presdigitation.

To reiterate, there's nothing fundamentally wrong but but it's probably not the most conservative way to present the data. However, the kind of data being reported doesn't necessarily require a conservatively stringent measurement.

I work (worked in a few weeks) in cancer biology; there are very many kinds of cancer. Hell, cancer that affects one organ/tissue type can be subclassified into many many different kinds of cancer. Tissue-specific cancers with the same name can be classified into many subtypes. I've looked at the quantitation of the various things in primary haematopoietic blasts from different patients and they've all been different. This paper is more a "hey, this is another thing that could be a problem (and perhaps this kind of problem can be checked in patients with this kind of cancer in hopes that a therapy for this specific thing can be discovered)" so it's more acceptable for this kind of statistical handwaving.

The original example of tree growth - it's actually a decent analogy except that it lacks the caveat that the trees being measured are growing in vastly different soil types, at different latitudes, are under the pressure of different pathogens, or are completely different species of trees.

So, basically, this type of study - I guess - can get away (because it's acknowledged that more stringent analyses may be fruitless) with being a little "sloppy."

posted by PurplePorpoise at 10:54 PM on January 24, 2006

Thanks for all the input!
posted by shoos at 12:35 AM on January 25, 2006

I looked at the paper. In figure 1 and 2 they did do what I try to say in my example: for every transfection with mock vector and expression plasmid they normalized to the value for mock vector, so that is now (by default) 1 in every experiment, and has no error bars when the average over x experiments is calculated.
In figure 5, they're looking at tumor cell invasiveness, and for some reason they normalized AFTER taking the average, so the error for control is not zero. I don't know why they it differently there, but it's a totally different experiment than the luciferase assays. Maybe convention? I don't know. I've done luciferase assays but no invasions.
posted by easternblot at 8:30 AM on January 26, 2006

« Older Educational podcasts?   |   My laptop won't charge... Newer »