How to Find The Average Commenter?
June 2, 2012 11:44 AM   Subscribe

Harmonic or Arithmetic Mean?: I'm trying to settle an argument with my wife - is using the arithmetic mean or harmonic mean more appropriate when finding the "average" number of, say, online comments made by a large number of people, where a significant portion of them make only one comment and a few people make the bulk of the comments?

I'm convinced it's the harmonic mean, which I think gives more weight to the number of posters, as opposed to the arithmetic mean, which I think gives more weight to the total number of posts. Help? Am I right? Would using the harmonic mean more closely reflect the habits of the average poster?
posted by salsamander to Education (13 answers total) 2 users marked this as a favorite
First, how do you propose actually calculating a harmonic mean in this case? Second, what kind of story do you want to tell using that number? In what ways is that story clearer or more informative than knowing the quartiles or seeing a histogram?
posted by Nomyte at 12:01 PM on June 2, 2012 [1 favorite]

Sure, if you had 5, 5, 5, 5, 10, 10, 20, 100 the arithmetic would be 20 and the harmonic would be about 8. Similarly the median would be 5 and 10, another good way to get at the average user's post count. Perhaps you and your wife disagree on what you are trying to measure, though.
posted by michaelh at 12:04 PM on June 2, 2012

Compromise by using the geometric mean. It will be between your two values.
posted by Obscure Reference at 12:22 PM on June 2, 2012 [2 favorites]

I would use the median for something like this. Of course, it'd be even more descriptive if you also included the median absolute deviation, or used Tukey's five-number summary.
posted by grouse at 12:27 PM on June 2, 2012

These are both valid summaries of a feature of the data. A geometric mean (or trimmed mean or median) tells you about 'most' people or a usual person, but an arithmetic mean tells you about the overall usage normalized by population size.
posted by a robot made out of meat at 12:38 PM on June 2, 2012

It appears that the harmonic mean is the correct method to use when you are considering rates, and then only when you are measuring your average by the nominator of the rate. Because when you are working with fractions, you need to make sure the denominator stays the same.

IE, the car example from wikipedia. The speed is miles per hour. If your samples are in miles, then the harmonic mean is appropriate. If your samples are in time, then the arithmetic mean is appropriate. Extending their example, suppose you travel for an hour at 100 mph, and then an hour at 1 mph. You will have gone 101 miles, and your average speed is 101/200, or 50.5 mph. And it works out, because 50.5 * 2 hours is 101 miles.

However, change it to distance. You travel 100 miles at 100 mph, and then 100 miles at 1 mph. You've gone 200 miles and it took you 101 hours. If you try to take the arithmetic mean, you get 200/101, or 1.98 mph. If you check your work by multiplying your calculated mph by your time, you end up with the wrong answer, 199.98 miles. The slower speeds got weighted too heavily.

If you instead take the harmonic mean, however, you get an average speed of 1.980198, which checks out correctly.

So I think the harmonic is only appropriate if you are measuring average comments per poster, but are counting by comments. IE, each sample is an answer to the question "how many posters does it take to get to X comments?"

In other words, situations where your answer is expressed in x/y, and your "control" is x.
posted by gjc at 12:45 PM on June 2, 2012

Averages of any kind are generally inappropriate when a distribution is fat-tailed. The median might be better. Even better is stepping back and asking what question you want answered.

In the situation you describe, there is no "average poster". It's not like "average male height in the US", where everything is centered around some number with nicely decaying gaussian curves in either direction. It's more like "average wealth", where you end up averaging Bill Gates, Warren Buffet, and the Walton family with a few thousand doctors and lawyers and a few million broke people. The average wealth is pretty uninformative because the distribution is fat-tailed with the mega-rich. What questions are you looking to answer?
posted by pmb at 12:52 PM on June 2, 2012 [1 favorite]

The idea was to find a way to account for outliers. The data set was made of comments tracked by unique user in a forum, which wound up looking like "1,1,1,1,15,20" (two users had high comment counts, the others had low comment counts), so I suppose we are looking for a "rate" of postings by user.
posted by salsamander at 12:55 PM on June 2, 2012

The idea was to find a way to account for outliers.

Neither of these is a great way of doing that.

I suppose we are looking for a "rate" of postings by user

That just sounds like post hoc justification.
posted by grouse at 2:06 PM on June 2, 2012 [1 favorite]

Just to emphasize that the harmonic mean is not immune to outliers, I'll point out that if anyone in your data set made 0 comments, the harmonic mean will be 0, no matter what other people did.

There's no right answer to your question. It's like asking "Should my profile picture be shot from the front or the side?" Both are representations which capture some features of your face and hide others. Which features are you interested in?

If someone held me at gunpoint and demanded that I describe your dataset with a single number, I suppose I'd go with median. But if what you're trying to convey is "a significant portion only made one post and the bulk of the comments came from a few users," I would report the median number of posts, and the number N such that the top N% of the posters made half the comments.
posted by escabeche at 2:28 PM on June 2, 2012 [3 favorites]

Your distribution is skewed. Use a median.
posted by dfriedman at 2:35 PM on June 2, 2012 [1 favorite]

Imagine a total echo-chamber of a forum. Ninety-nine users have never posted at all, and one has posted n times, where n may be in the hundreds or thousands. Intuitively speaking, how many times has "the typical user" on this forum posted?

My own intuition here is that "zero" is the only reasonable answer. Any metric that leaves you saying something like "the typical user here has posted once or twice" is an inappropriate and misleading metric.

Well, if you use any sort of mean as your metric, there is a value of n for which you will be stuck saying "the typical user here has posted once or twice."

You want to use the median, or even possibly the mode.
posted by nebulawindphone at 3:25 PM on June 2, 2012

Your distribution is skewed. Use a median.

I would expand that to say: "Use the mean and the median. Together they give you a rough idea of the distribution and skewness."

Whenever I am given a statistical result, mean or median, the first thing I ask is the other.
posted by JackFlash at 4:20 PM on June 2, 2012 [1 favorite]

« Older All I know about Indiana comes from Parks &...   |   What were the roots of the classic Beach Boys... Newer »
This thread is closed to new comments.