# Chance of event with small sample size, based on larger related sample?

March 20, 2014 7:20 AM Subscribe

Can/how can one improve the estimate for a chance of an event with a small historical sample size by utilizing the chance of a related event with a large historical sample size? Example and half-assed guess inside.

A baseball-related example (for the purposes of this question, please forget about complicating factors like lefty/righty splits, home/away splits, the fact that a particular player might be better or worse now than he was in the past, etc.):

Joe has had 1000 at bats. He has gotten a hit in 270 of those 1000 at bats.

Of those 1000 at bats, 10 were against the pitcher Fred. In those 10 at bats against Fred, Joe got six hits.

Clearly we can say "Joe is a .600 hitter against Fred". But also clearly, that doesn't really have any meaningful predictive power for Joe's future at bats against Fred. If we want to guess what the chance of Joe getting a hit off of Fred is, 27% is almost certainly a much better guess than 60%.

But can we use

I have a half-assed guess, which I'll describe momentarily, but it occurs to me that this is probably a problem which has been thought about rigorously by mathematicians. So does anyone know if there's a "real" answer to this problem?

My half-assed guess is something along these lines:

Joe has 10 at bats against Fred, and 6 hits in them. But Joe has 1000 at bats total (with 270 hits). Let's assume that if Joe had had 1000 at bats against Fred, 10 of them would have gone as they did, and the other 990 would have been as if against an average pitcher. So Joe would have gotten:

6 + 990 * 270 / 1000

= 6 + 267.3

= 273.3

So we guess that in his upcoming at bat against Fred, Joe has a 27.33% chance of getting a hit.

A baseball-related example (for the purposes of this question, please forget about complicating factors like lefty/righty splits, home/away splits, the fact that a particular player might be better or worse now than he was in the past, etc.):

Joe has had 1000 at bats. He has gotten a hit in 270 of those 1000 at bats.

Of those 1000 at bats, 10 were against the pitcher Fred. In those 10 at bats against Fred, Joe got six hits.

Clearly we can say "Joe is a .600 hitter against Fred". But also clearly, that doesn't really have any meaningful predictive power for Joe's future at bats against Fred. If we want to guess what the chance of Joe getting a hit off of Fred is, 27% is almost certainly a much better guess than 60%.

But can we use

*both*pieces of information to get a guess that's better than "27%"?

I have a half-assed guess, which I'll describe momentarily, but it occurs to me that this is probably a problem which has been thought about rigorously by mathematicians. So does anyone know if there's a "real" answer to this problem?

My half-assed guess is something along these lines:

Joe has 10 at bats against Fred, and 6 hits in them. But Joe has 1000 at bats total (with 270 hits). Let's assume that if Joe had had 1000 at bats against Fred, 10 of them would have gone as they did, and the other 990 would have been as if against an average pitcher. So Joe would have gotten:

6 + 990 * 270 / 1000

= 6 + 267.3

= 273.3

So we guess that in his upcoming at bat against Fred, Joe has a 27.33% chance of getting a hit.

(Exactly how they work would vary from model to model. Most obviously the internal guts and optimizers of a Bayesian MCMC version will be different from a maximum-likelihood version.)

posted by ROU_Xenophobe at 7:35 AM on March 20, 2014

posted by ROU_Xenophobe at 7:35 AM on March 20, 2014

Response by poster: Thanks, but perhaps I should be more clear: When I asked "So does anyone know if there's a "real" answer to this problem", I didn't mean "yes/no", I meant "how can I solve problems like this". And I don't mean baseball specifically.

posted by Flunkie at 7:36 AM on March 20, 2014

posted by Flunkie at 7:36 AM on March 20, 2014

This is called a "shrinkage estimator" (or Stein, or Empirical Bayes, or you could do actual Bayes), and it so happens that the canonical example is actually about batting averages. So yes, it exists. How you should do it is often depends on the context (in your example one would shrink with that player's other data instead of other player's data).

posted by a robot made out of meat at 7:36 AM on March 20, 2014

posted by a robot made out of meat at 7:36 AM on March 20, 2014

OK, if it's not about baseball are you interested in something that's an average? Is it a small or large dataset? Do you have experience with any particular statistical package? Does it need to be reliable enough for some kind of work product or is it an intellectual itch? Stats people think of this like a medical question, "I have a weird mole, what should I do?" It depends on the particulars and often you have to look at it to really know.

posted by a robot made out of meat at 7:42 AM on March 20, 2014

posted by a robot made out of meat at 7:42 AM on March 20, 2014

Response by poster: Regarding the particular thing I'm interested in at the moment: Yes, it's an average. The larger thing is hundreds of samples, and the smaller is anywhere from one to virtually the size of the larger. By "statistical package", I'm not sure what you mean, so the answer is probably no, and if you mean something like "a stats plugin for some program like Mathematica or whatever", the answer is definitely no. It is more of an intellectual itch and does not need to be reliable enough to, uh, rely on. I have no weird moles that I know of.

Generally speaking the baseball example I gave seems to me very similar to what I actually want to calculate, as long as you ignore things like lefty/righty splits, as I mentioned above. I don't really want to discuss the particulars of the thing I actually want to calculate, so if necessary please assume the baseball example (with simplifying assumptions such as "no such thing as lefty/righty splits") is what I want.

The shrinkage estimator, Stein, Empirical Bayes, and "canonical example" stuff has given me a lot to look through that at first glance I think will be helpful - thanks.

posted by Flunkie at 7:57 AM on March 20, 2014

Generally speaking the baseball example I gave seems to me very similar to what I actually want to calculate, as long as you ignore things like lefty/righty splits, as I mentioned above. I don't really want to discuss the particulars of the thing I actually want to calculate, so if necessary please assume the baseball example (with simplifying assumptions such as "no such thing as lefty/righty splits") is what I want.

The shrinkage estimator, Stein, Empirical Bayes, and "canonical example" stuff has given me a lot to look through that at first glance I think will be helpful - thanks.

posted by Flunkie at 7:57 AM on March 20, 2014

Response by poster:

posted by Flunkie at 8:03 AM on March 20, 2014

But I want to be clear: By "what I want", I mean that I want to know how to do this. I definitely do not mean anything like "Please show me a web page listing the estimated chances of every individual major league hitter getting a hit against any individual major league pitcher."I don't really want to discuss the particulars of the thing I actually want to calculate, so if necessary please assume the baseball example (with simplifying assumptions such as "no such thing as lefty/righty splits") is what I want.

posted by Flunkie at 8:03 AM on March 20, 2014

Like in a baseball example do you observe Joe's average against each pitcher or like you've written it just Fred and All Others? Are the data rates like the baseball example or do you get a list of numbers for each condition?

posted by a robot made out of meat at 8:31 AM on March 20, 2014

posted by a robot made out of meat at 8:31 AM on March 20, 2014

Response by poster: I observe the averages of every specific pair of individuals, including Joe vs. Fred, Joe vs. Pat, Ernest vs. Fred, Ernest vs. Pat, etc.

I'm not sure what you mean by "Are the data rates like the example or do you get a list of numbers for each condition".

posted by Flunkie at 8:33 AM on March 20, 2014

I'm not sure what you mean by "Are the data rates like the example or do you get a list of numbers for each condition".

posted by Flunkie at 8:33 AM on March 20, 2014

Response by poster: To be more specific, I observe more than merely the averages of every specific pair of individuals; I observe the number of hits and the number of at bats of every specific pair of individuals. For example I know that Joe has 6 hits in 10 at bats against Fred, but I also know that Ernest has 3 hits in 12 at bats against Pat, and that Ernest doesn't have any at bats at all against Fred.

posted by Flunkie at 8:35 AM on March 20, 2014

posted by Flunkie at 8:35 AM on March 20, 2014

Joe vs Fred is 9 successes of 20 trials or Joe vs Fed is a set like [1, -2, 0.25 , 0.5].

posted by a robot made out of meat at 8:36 AM on March 20, 2014

posted by a robot made out of meat at 8:36 AM on March 20, 2014

And the goal of inference is

posted by a robot made out of meat at 8:41 AM on March 20, 2014

*all future pairwise comparisons*or Fred on average or a particular matchup?posted by a robot made out of meat at 8:41 AM on March 20, 2014

Response by poster: Joe vs. Fred is 9 successes out of 20 trials.

The goal of inference is all future pairwise comparisons.

posted by Flunkie at 9:15 AM on March 20, 2014

The goal of inference is all future pairwise comparisons.

posted by Flunkie at 9:15 AM on March 20, 2014

In that case, like ROU said I'd use glmer in the R package lme4. It has examples for binomial outcomes and prediction, and you want a random effect for pitchers and batters.

posted by a robot made out of meat at 9:33 AM on March 20, 2014

posted by a robot made out of meat at 9:33 AM on March 20, 2014

To build on that:

A statistical package is just software intended for statistical analysis, like R, Stata, or SPSS. As distinct from a spreadsheet like Excel or something closer to a programming environment than application like Mathematica. R is free. Your workflow with R would look something like

*Download and install R and Rstudio. Both are free. Rstudio does a good job of taming the user-hostile bits of R.

*From Rstudio, download and install the lme4 package.

*Read the documentation and any of about a zillion web tutorials on the various commands in lme4.

*Save your data as a CSV in a format that glmer will be able to digest as an input object with little/no further manipulation.

*Run your model(s).

posted by ROU_Xenophobe at 10:03 AM on March 20, 2014 [1 favorite]

A statistical package is just software intended for statistical analysis, like R, Stata, or SPSS. As distinct from a spreadsheet like Excel or something closer to a programming environment than application like Mathematica. R is free. Your workflow with R would look something like

*Download and install R and Rstudio. Both are free. Rstudio does a good job of taming the user-hostile bits of R.

*From Rstudio, download and install the lme4 package.

*Read the documentation and any of about a zillion web tutorials on the various commands in lme4.

*Save your data as a CSV in a format that glmer will be able to digest as an input object with little/no further manipulation.

*Run your model(s).

posted by ROU_Xenophobe at 10:03 AM on March 20, 2014 [1 favorite]

You could take an approach like this. Suppose there are 100 pitchers. The effectiveness of each pitcher against Joe I.e.long term batting average is a random variable with a mean of 270 and some variance. You can use the experience of 6/10 to estimate the number you want.

I think you are going to need some idea of the variance between pitchers no matter how you proceed.

posted by SemiSalt at 5:55 PM on March 20, 2014

I think you are going to need some idea of the variance between pitchers no matter how you proceed.

posted by SemiSalt at 5:55 PM on March 20, 2014

« Older Help me find some U.S. elementary school study... | Number of self-published vs. officially published... Newer »

This thread is closed to new comments.

posted by ROU_Xenophobe at 7:33 AM on March 20, 2014