can you name this stats theorem?
August 8, 2010 5:58 AM   Subscribe

There is an urban legend in the baseball statistics community I would like to find broader perspective on. Ideally a statistics journal cite would be great if you know of one. It has to do with sample size and point estimation.

Short example word problem: after two weeks of the season Alex Rodriguez is off to a smoking hot start and he is batting .470. At this point the league as a whole is batting .260. The best prediction of Alex's season ending batting average is .260. Two weeks is not a large enough sample size for the purpose of this estimate.

The urban legend part. The baseball statistics fans say this is a famous statistical theorem and the fellow that derived it used baseball statistics in his argument. I am pretty sure that cannot be right, but it may be there is a subtler, similar statistical theorem which did rely on its first example in the literature on baseball players' averages.

What I am very curious to find out is what such a theorem might be and where it first was published in professional statistics journals in association with a baseball example, if it does truly exist?

Thank you very much!
posted by bukvich to Science & Nature (7 answers total) 3 users marked this as a favorite
 
It's not clear exactly what you are asking here but, in the absence of available information, the mean is the best estimate of an individual score. However, this is a weird example because the best prediction for Alex's season ending batting average is actually his lifetime batting average (i.e., the individual mean). It would only be the league mean if we had no information at all on his previous performance. You're unlikely to find an article in a journal about this but you will certainly find this information in a basic statistics textbook. To my knowledge, this was settled long, long before baseball fans starting being stats geeks.
posted by proj at 6:11 AM on August 8, 2010


That's not quite true. The best estimate would be somewhat better than .260, for several reasons:
1. Alex Rodriguez is known to be a better-than-average player. So we guess that for the rest of the season he will do better than average -- maybe not .470, but whatever his historic average is.
2. you can't take away the good luck that he had in the first couple weeks!

Still, some weaker version of this is true.
posted by madcaptenor at 6:11 AM on August 8, 2010


The term for what it sounds like you are describing is "regression to the mean." It's talked about a lot in baseball statistics discussions, but it didn't originate there.
posted by monju_bosatsu at 6:25 AM on August 8, 2010 [1 favorite]


Response by poster: monju_b I agree totally with your point. Self-taught baseball stats guys have a tendency to think that Bill James invented regression to the mean, not Galton.

But often these things have a kernel of truth. I am wondering if there is a statistical gem of slightly less profound power than mean regression which did have its birth in baseball statistical analysis; that there was at least one instance of the field of Statistics being marked in a substantial way by the work of a full time statistician, part time baseball fan.

(Also I agree that my oversimplified question phrasing of the Alex Rodriguez example is wrong. I am not a statistician!)
posted by bukvich at 9:07 AM on August 8, 2010


Best answer: I am wondering if there is a statistical gem of slightly less profound power than mean regression which did have its birth in baseball statistical analysis; that there was at least one instance of the field of Statistics being marked in a substantial way by the work of a full time statistician, part time baseball fan.

I am convinced you are referring to Stein's Paradox (the Wikipedia article does a terrible job of showing what could be paradoxical, incidentally, but see below for a better link).

I find it difficult to state Stein's paradox, and I would hope someone might come along who would do a better job (and correct the errors I am likely to make), but it has to do with the fact, contrary to all intuition, that there are estimators better than the average to estimate unobservable quantities in many situations, and that these estimators can make use of observations which no one would imagine could be related to the quantities in question to make these improvements on the average.

In the case of the baseball stats you use, for example, the improved estimator might make use of an observation of how close in percentage terms the Russian wheat harvest in a given year came to the maximum wheat harvest ever recorded.

In 1977, Bradley Efron and Charles Morris wrote an excellent article for Scientific American titled Stein's Paradox in Statistics (click on the first link in the reference section I've linked for a free PDF) in which they use an extended example involving batting averages in major league baseball which has so many points of resemblance to your example that I would be utterly stunned if it were to be coincidental, only they use Roberto Clemente as a central figure rather than Alex Rodriguez.

I think this article must have given rise to that urban legend in the baseball stats community, even though I would be amazed if that were the case.
posted by jamjam at 1:41 PM on August 8, 2010 [3 favorites]


I'm not sure what the theorem is, but I believe the discussion has its roots in an article which the evolutionary biologist Steven Jay Gould published in 1986 in Discover magazine, entitled "Why no one hits .400 any more". Gould's basic argument is that rising overall standards of play over time (for both pitchers and batters) leads to a decrease in variance for statistics such as batting average, which makes standout performances (like someone hitting .400 or more for a whole season) increasingly unlikely.

I can't find Gould's article online, but there are discussions of it here and here.
posted by muhonnin at 5:43 AM on August 9, 2010


Response by poster: jamjam you have answer askmetafilter wizard powers. There is little doubt in my mind you have identified the published source of my vague datum. And very helpful too as I am in the process of writing a long blog post which now will be referring the Scientific American article.

If we ever meet in real life I shall offer to buy you a beer!
posted by bukvich at 8:17 AM on August 9, 2010


« Older looking for Lionel Richie "Hello" gay(?) spin off   |   Buying my first motorcycle Newer »
This thread is closed to new comments.