Comments on: When calculating a statistical average, is it reasonable to remove outliers?

Question: When calculating a statistical average, is it reasonable to remove outliers?

JPigford — Fri, 13 Jan 2012 10:48:06 -0800

When calculating a statistical average, is it reasonable to remove outliers?

Say you have the following amounts of time it took to complete a task:

1:32:49
9:54
0:46
0:32
0:38
0:28
0:01

Is it reasonable to remove the 1:32:49, 9:54 and 0:01 since those are almost certainly anomalies?

By: saeculorum

saeculorum — Fri, 13 Jan 2012 10:51:59 -0800

This question requires more detail. Please provide details on what you are trying to do. Otherwise, your question is equivalent to "when walking down a sidewalk, is it reasonable to use the left side of the sidewalk?".

By: supercres

supercres — Fri, 13 Jan 2012 10:53:00 -0800

Depends what you're looking for. Are they informative?

Would a median be a better measure, perhaps?

By: Holy Zarquon's Singing Fish

Holy Zarquon's Singing Fish — Fri, 13 Jan 2012 10:54:07 -0800

Also, if your example dataset is representative of the figures you're actually averaging, 3/7 of your results going to an apparently unreasonable extreme is less an anomalous outlier to be massaged out of the statistics and more of an issue that needs investigation.

By: JPigford

JPigford — Fri, 13 Jan 2012 10:54:55 -0800

I'm trying to get an average amount of time it took to complete a task. Yes, they are for information purposes.

ie. It took users on average, 37 seconds to complete this task.

By: JPigford

JPigford — Fri, 13 Jan 2012 10:55:56 -0800

@Holy: Assume there are more like 1000 data points but still only 3-4 outliers.

By: gramcracker

gramcracker — Fri, 13 Jan 2012 10:57:14 -0800

No, it's better to use a median than delete certain data points.

By: homotopy

homotopy — Fri, 13 Jan 2012 10:57:29 -0800

From the sounds of it, taking the median or interquartile mean would work for your purpose.

By: yoink

yoink — Fri, 13 Jan 2012 10:58:26 -0800

I would report both figures (i.e., average with outliers and average without) and then offer some interpretive explanation for the outliers that would justify excluding them: i.e., is it possible that in a few instances people were called away from the task by some external distraction? If there is no such explanation available then there is no reason to delete the outliers.

By: saeculorum

saeculorum — Fri, 13 Jan 2012 10:59:21 -0800

To further explore what supercreas and Holy Zarquon have already said, a useful corollary would be net worth in the United States. The average ("mean") net worth in the US is about $500k. The median net worth is about $100k. The difference is due to the gross wealth inequality in the US - the wealth at the top end of the population would be an "anomaly" under your interpretation. That doesn't mean the average net worth statistic is wrong - it just needs some more interpretation and context. As a result, in a lot of circumstances, the average is not a very useful statistic.

You'll need a fancier statistic than the average if you want exclude the outlying data points. That's not a bad thing, it just requires more definition on your part.

By: DrGail

DrGail — Fri, 13 Jan 2012 11:13:49 -0800

Assume there are more like 1000 data points but still only 3-4 outliers

If you have about 1000 data points, the outliers probably aren't going to impact your average very much. That is, after all, the whole point of having a large sample. But certainly, as yoink suggests, compute the average both ways. If they differ by much, then report them both and offer some explanation.

By: Blazecock Pileon

Blazecock Pileon — Fri, 13 Jan 2012 11:16:48 -0800

If you want to mitigate the effect of outliers, take a median.

If you want to visualize your distribution of outliers, use a box-and-whisker plot. Even if you take a median, you might still have a bimodal (or x-modal) distribution that gives lots of "outliers" (ignoring your sample data, for a moment).

By: NoDef

NoDef — Fri, 13 Jan 2012 11:17:55 -0800

There are many statistical tests to estimate the statistical significance of an outlier. Don't just drop the numbers, find a justification (T test, etc.) for why they are not statistically relevant.

Don't just use the median as the replacement for a mean when you have outliers - unless you really are just looking for the median value (it doesn't sound like this is the case...)

Assuming the mean will not be impacted by outliers when the sample set is large also does not always work - since the outlier may be close in magnitude to the number of samples and still impact the result (outlier of 1000 in the example above).

By: chairface

chairface — Fri, 13 Jan 2012 11:18:14 -0800

No. It's not. Because mean (average) has a mathematical definition. That definition is not the same as "typical" which is what we usually mean when we say average.

So take the median. Median represents typical much better than mean.

By: supercres

supercres — Fri, 13 Jan 2012 11:18:16 -0800

Can you plot a histogram? Does it look roughly like this? Then your data are normal, and an average is appropriate, outliers or no. Enough data points will negate them.

If it doesn't, I would report median. Or, hell, the whole histogram might give a better sense of the data for anyone who cares.

By: Kpele

Kpele — Fri, 13 Jan 2012 11:21:08 -0800

Nthing everyone that says to keep the outliers and use the median, not the average. Don't manipulate your data if possible, it just makes your conclusions look suspect when you do that.

By: ROU_Xenophobe

ROU_Xenophobe — Fri, 13 Jan 2012 11:40:34 -0800

Ill differ and say that what you really want to do is treat your data as a sample and find a confidence interval around your sample mean. Your outliers will then bump up the mean a little and inflate your standard error. Youd report it as "The average time is between LOW BOUND and HIGH BOUND."

By: kiltedtaco

kiltedtaco — Fri, 13 Jan 2012 11:52:31 -0800

This is a very important question, and one that I don't think can be dismissed by suggesting a switch to alternative statistics or any sort of hand waving.

The question you have to ask yourself is: What is your model for the data?

It's easier to give an example of this than to give a good definition. For example, I could imagine that the model you have in your mind for this data is that your task requires about a minute to complete, though some people will do it slightly faster and some will do it slightly slower, and that range of variation is about 20% of the time the task takes. A simple model, we haven't said anything about a functional form yet and that's fine, but this is enough to get started.

Now, some statistical 'purists' might argue that by picking a model you are imposing your subjective will on the data and making it fit a scenario which might not actually be true. This is wrong, because 1) there's no way to avoid having a model, because at the very least you will have constraints/ideas like "the duration of the task cannot be negative" and "the length of time cannot be longer than the age of the universe" and 2) reporting any value about a system (like a mean time) implies a model, you wouldn't report a mean if the task durations were uniformly distributed in decades of time from 10^-6 to 10⁶ seconds, because it would be worthless.

Ok, back to your model: you say it takes about a minute +/- 20% or so. That seems to be ok, except for the time that it took one second and the time it took over an hour. Those time are clearly suspicious, and don't fit in with your model, which means you must update your model. This will depend on the exact circumstances of your measurement. For example, I could imagine that every so often, whatever software you're using for making the time measurements suffers a catastrophic failure, and forget to record the task end until the next person starts a task or something. Or some fraction of the time the software thinks the task finished immediately and records 1.0 seconds. So your model is now something like 1 minute +/- 20%, plus a 5% chance of catastrophic timing failure.

Now, if someone asks you how long the task took, they're don't care about the frequency of your catastrophic error, they want to know a value that's representative of the time the task actually took. So if you include a mean that's over an hour (because of the outlier), you're obviously doing them a disservice. If this is your model, reject the outliers. You know what causes the outliers, you know their effect on your data, it's not a problem. I'm not going to go into how to do it mathematically, but I want to say that conceptually, this is ok. Don't let people tell you that you have to report a crappy and unhelpful number just because of some idea about the sanctity of the data.

On the other hand, let's imagine another model. Say your task usually takes about a minute, but there's a hidden shortcut that let's some people finish super fast, and there's also a bottleneck that some people get stuck in and they take an hour. This would be perfectly consistent with the data, in fact, it would be just as consistent as the model I described above. But in this case, do not reject your outliers. They are telling you something! When your supervisor (or whoever) asks you how long the task takes, you want to give three values, the short, medium, and long times, because you have a multi-modal probability distribution on your hands. Rejecting the short and long this time would be a disservice, because they do actually contain information about the duration of the task.

Can we decide which model is correct from the data alone? In this case, probably not. But that's not a problem. Your model must be informed by your prior knowledge of the measurement system and the system under test. Again statistical 'purists' might argue that that's cheating, you are supposed to be making objective and impartial measurements, but the same arguments I gave above still apply. You have a mental model of the system whether you think you do or not, and ignoring what you know about the system in the interest of 'impartiality' does a disservice to anyone who reads the values you report.

The bottom line is that you want your analysis to fully represent your knowledge about the system you're trying to describe. Your understanding might include knowledge of various flaws in the measurements, or strange behaviors in the system under test itself, and you should use all of this information to report the best description of the system that you can. That goal and your prior knowledge of how the system works should determine whether or not outliers are something you want to include in your analysis or reject.

By: springload

springload — Fri, 13 Jan 2012 11:53:53 -0800

For a completely unknown system, it's risky to remove outliers, but most systems are not, and if it's obvious that something interfered with the experiment, you can discard that point. If, for example, I measure something at ultra-low temperature and one read-out out of a thousand shows a temperature equivalent to the surface of the sun, it was obviously a power glitch or some other fault in the instrument. I can infer that from the data and the physics alone, without knowing exactly why that data point turned bad. I am not hiding anything of any importance by removing the data point from the analysis.

Declare somewhere in text what you did. Just "obvious outliers were excluded from the analysis" or something to that effect is enough if they really are anomalies, you don't need to say which points or give their values. As a physicist, that's how I would do it in a paper. As long as I am sure others would agree that it's a reasonable point to remove, it's ok. If you have a sound reason external to your data (i.e. if you know what caused the anomalous times), all the better.

By: lathrop

lathrop — Fri, 13 Jan 2012 12:16:45 -0800

Winsorizing replaces extreme values by certain percentiles, e.g. by the 5% and 95% values, essentially tucking them in rather than trimming them off. The percent used can be determined by the data.

By: no regrets, coyote

no regrets, coyote — Fri, 13 Jan 2012 12:18:05 -0800

If this is for something casual, sure go for it.

If it's even remotely academic, don't.

Even searching for "reasons" why the outliers exist and then removing them isn't a great thing to do.

What you're supposed to do is before you collect the data have a set of criteria for which data points are to be included in your dataset. Say those extra long data points are caused by someone getting distracted. You're allowed to say, before you start collecting data, "if the subject gets distracted (defined as ....) we will not include that datapoint". You shouldn't make these decisions after you've collected your data because that means you're allowing yourself to manipulate the data to get the result you want. That's not good science.

I agree with the people above who say that the best course of action is to show the whole histogram, state the median, and explain the outliers the best you can (if they aren't statistical).

By: springload

springload — Fri, 13 Jan 2012 12:21:16 -0800

Read kiltedtaco's answer carefully, because it's a very good one.

By: theodolite

theodolite — Fri, 13 Jan 2012 12:28:54 -0800

Winsorizing replaces extreme values by certain percentiles, e.g. by the 5% and 95% values, essentially tucking them in rather than trimming them off. The percent used can be determined by the data.

Don't do this (or any other fancy data transformation) without having a very good idea of whether it's appropriate or not.

By: jjmoney

jjmoney — Fri, 13 Jan 2012 12:51:53 -0800

Academic statisticians would say no. As a more pragmatic economist, myself, I say yes.

The reason is it seems as though you are trying to get a number/time that is useful to you for an end. Now if you gather a bunch of data points and one is a magnitude of 1000 larger than the others what does this mean? Normally in a proper academic and logical positivist scientific setting you cannot say what this means and doing so breaks the integrity of the set. In real life for pragmatic purposes you can go "this guy was obviously AFK and his data isn't useful for me to try to figure out how long this program takes."

By: Nelson

Nelson — Fri, 13 Jan 2012 13:17:33 -0800

I came to write a bunch of stuff but kiltedtaco beat me to it. Long story short: remove outliers if you are confident they are not reflective of the actual thing you are measuring. There's no magic purity about "the real average", you're always making decisions about what to measure and characterize. The truth comes out of making the right decisions and explaining what they were.

By: dialetheia

dialetheia — Fri, 13 Jan 2012 13:19:41 -0800

In my statistics classes, we were taught that the salient issue is whether or not those data values are really a part of your intended study population. This is another way of saying what kiltedtaco said so well above: if those outlier values came from people who are not in the intended population (e.g. a five year old when you're intending to study adults, or someone who had a catastrophic equipment malfunction when you only intend to analyze performance when the equipment is working properly), then you can safely exclude them. If not, you have to include them and qualify your results accordingly.

Basically, you just have to do some serious thinking about how you want to define your intended study population. As others mention, it's generally better to do this before beginning to collect data if at all possible.

By: bonehead

bonehead — Fri, 13 Jan 2012 13:44:01 -0800

What is your distribution of values like?

If you can prove that your distribution is normal, then that makes this discussion a lot easier. You can use defined tests and statistical intervals to discuss your data. A normal distribution means that there's no information to be had in your model---it's pure random chance. Outliers are just statistical occurrences with no special meaning. They're best discussed by reporting confidence intervals or standard uncertainties.

If you find that your distribution is not normal, such as a skewed normal or multi-modal, not a good match for a single symmetrical randomized hump, you need to dig into the model as kiltedtaco suggests. That means that there is information in your model, and outliers are important pieces of information.

By: birdherder

birdherder — Fri, 13 Jan 2012 14:18:04 -0800

To me, I'd toss the outlier cases and report the mean of the valid cases for most marketing-related reporting I'd do. If I got that data set, I'd investigate why there'd be those gross outliers. I'd also toss all other data relate to those cases. So instead of reporting n=1000, I'd make it n=997. I'd also have in backup why the cases were outliers and why I jettisoned them. But they'd have to be gross outliers like in your example. In a process that takes under minute and that is the expected value and for some reason a case takes 10 minutes or 90 minutes, and you can't prove those cases are invalid you have to count them.

To me, reporting information has to be actionable or at the least telling the story. If I said the average net worth of the people at luncheon was $5 Billion, it is completely meaningless if it was at a homeless shelter where a billionaire showed up for a photo op. If you were trying to demonstrate the plight of the homeless, you'd not count the billionaire. If you're trying to demonstrate homeless people don't have it so bad, you do count the billionaire.

By: juliapangolin

juliapangolin — Fri, 13 Jan 2012 15:09:11 -0800

There are statistical tests that help you decide whether you can throw out an outlier. An example of this is the Q test.

By: hattifattener

hattifattener — Fri, 13 Jan 2012 17:25:09 -0800

The reason is it seems as though you are trying to get a number/time that is useful to you for an end.

I start with the same point, but I reach the opposite conclusion. Unless you know that those outliers are not relevant to your end, leave them in, or increase your sample size until they become insignificant.

In real life for pragmatic purposes you can go "this guy was obviously AFK and his data isn't useful for me to try to figure out how long this program takes."

Well, why do you want to know how long the program takes? If you're trying to figure out how many people it will take to keep up with a workload, for example, then you need to account for AFK, computer crashed, interrupted by manager, etc. events; they're "outliers" but they still need to be included in your average-time-for-one-worker-to-do-one-thing. If your measurement shows that these kinds of "failures" happen, but you make plans based on the assumption that they never happen, then your plans will not work.

By: springload

springload — Fri, 13 Jan 2012 17:27:44 -0800

I don't agree that using the median value is the same as to "switch to alternative statistics or any sort of hand waving" though. You want to assign a representative value to your data set, and there is nothing super-special about the arithmetic mean. The median value is a good representative because of its ability to reject outliers, but you use it under the implicit assumption that the more uniform sequence of values in between the extremes is what holds valuable information, just like when rejecting points and then taking the mean. Since the median doesn't require you to set a rejection bar, it can remain credible under somewhat more difficult circumstances, when it wouldn't be clear what points to reject.

By: philipy

philipy — Fri, 13 Jan 2012 19:42:28 -0800

Since you are dealing with timing people using software, there could easily be issues such as...

- Some people never wanted to do the task in the first place. They clicked the wrong button, got into your task by accident and then bailed in the quickest way they could find.

- There are bugs in your tracking software such that it sometimes doesn't correctly register when they finished the task, and only stops the clock when they come back and do the task a second time.

So it's probably worth investigating to see if you can understand what happened with these outliers.

If you want to do find the average just so you can tell users "this will take you about X minutes", I'd suggest not using the average at all, but maybe find the level at which 95% of users have completed and then tell them: "This takes most people less than X minutes".

Why you want to do all of this, and what you plan to do with the resulting statistics is important to whether it's reasonable to drop outliers completely.