# Probability question (in need of code)

August 13, 2008 10:04 PM Subscribe

It's late, I'm tired, and I have a probability and coding question that's fairly simple.

Say I have 5 buckets (A,B,C,D,E), with different colored balls in them. The probability of removing a red ball is different for each bucket:

A: 1/3

B: 1/4

C: 1/16

D: 1/6

E: 1/9

If I draw 1 ball from each bucket, what is the probability (P) of drawing at least N red balls?

Doing it for small numbers of buckets is easy enough, but I have thousands of buckets here. Given that I know the probabilities for each bucket, what's the easiest way to calculate P? I suspect that there's an R function that makes this a breeze, but I'm having trouble tracking it down.

I can do basic operations in R, and I'm also open to functions or pseudo code from your favorite language. Ruby and Perl preferred, but I can use others if they get the job done.

For good measure, here's a preemptive "This isn't homework-filter"

Say I have 5 buckets (A,B,C,D,E), with different colored balls in them. The probability of removing a red ball is different for each bucket:

A: 1/3

B: 1/4

C: 1/16

D: 1/6

E: 1/9

If I draw 1 ball from each bucket, what is the probability (P) of drawing at least N red balls?

Doing it for small numbers of buckets is easy enough, but I have thousands of buckets here. Given that I know the probabilities for each bucket, what's the easiest way to calculate P? I suspect that there's an R function that makes this a breeze, but I'm having trouble tracking it down.

I can do basic operations in R, and I'm also open to functions or pseudo code from your favorite language. Ruby and Perl preferred, but I can use others if they get the job done.

For good measure, here's a preemptive "This isn't homework-filter"

A minor clarification, just for the sake of pedantry:

The probability of

posted by mbrubeck at 10:49 PM on August 13, 2008

The probability of

**at least**N red balls is equal to the probability of 1 red ball from the first bucket and**at least**N-1 from the rest, plus the probability of 0 red balls from the first bucket and**at least**N from the rest.posted by mbrubeck at 10:49 PM on August 13, 2008

Best answer: The process you're describing should asymptotically approach a Poisson distribution with lambda equal to n times the average of all your p (equivalent to the sum of all p). You can use 1 - ppois(x, lambda) in R to get a pretty good approximation of the probability of drawing more than x balls.

posted by shadow vector at 11:02 PM on August 13, 2008 [2 favorites]

posted by shadow vector at 11:02 PM on August 13, 2008 [2 favorites]

Best answer: There is no exact named result for the sum of non-iid bernullis. If p is a list of probabilities and you are sampling once from each bucket:

Re(fft(apply(mvfft(rbind(1-p, p, 0 * p[-1] %o% p)), 1, prod), inverse = TRUE) / (length(p) + 1))

will give you the exact distribution in R (I think). If you have THOUSANDS, you're probably better off using asymptotic methods or simulating it, as others above suggested.

posted by a robot made out of meat at 5:03 AM on August 14, 2008

Re(fft(apply(mvfft(rbind(1-p, p, 0 * p[-1] %o% p)), 1, prod), inverse = TRUE) / (length(p) + 1))

will give you the exact distribution in R (I think). If you have THOUSANDS, you're probably better off using asymptotic methods or simulating it, as others above suggested.

posted by a robot made out of meat at 5:03 AM on August 14, 2008

Oh, the output of that command is the density, ie if you did x={command} then x[i] is the probability of exactly i balls. If you wanted the probability of at least i balls, that would be 1-sum(x[1:(i-1)]). With 5000 random probabilities, on my 2002 POS machine that took ten minutes (because of swapping -needed about 1.5 GB memory) , and a single modern processor on my server did the same job in about 50 seconds. Still, there are obvious machine-precision artifacts so it's not exact anymore. For example, it says that the minimum probability is about -1x10^-16, when it should be 10^-1500.

posted by a robot made out of meat at 7:54 AM on August 14, 2008

posted by a robot made out of meat at 7:54 AM on August 14, 2008

extending shadow vector, that will also tend to a gaussian (ie normal dist) with mean np and variance npq, where p is the sum of probabilities and q=1-p.

posted by not sure this is a good idea at 8:53 AM on August 14, 2008

posted by not sure this is a good idea at 8:53 AM on August 14, 2008

If you want to simulate it, put all your probabilities in a vector called p_i and do this:

Warning: with several thousand elements in p_i this may take some non-trivial computer time.

posted by shadow vector at 11:49 AM on August 14, 2008

totals <> successes <> for (h in 1:1000) { for (k in 1:(length(p_i))) { successes[k] <> } totals[h] <> }The totals vector will now contain the number of balls drawn in 1000 simulated runs. You can increase this number by making h run for more iterations. Totals will approach the distribution of interest, and for the purposes of quantiles and the like you can pull them directly from the simulated data and they'll be very close to accurate.

Warning: with several thousand elements in p_i this may take some non-trivial computer time.

posted by shadow vector at 11:49 AM on August 14, 2008

Ah, whoops, HTML ate my less thans. Here:

posted by shadow vector at 11:52 AM on August 14, 2008

totals <- NULL successes <- NULL for (h in 1:1000) { for (k in 1:(length(p_i))) { successes[k] <- rbinom(1,1,p_i[k]) } totals[h] <- sum(successes) }

posted by shadow vector at 11:52 AM on August 14, 2008

Response by poster: Thanks, guys. I know how to use simulations to approximate answers, but it occurred to me that I had enough information to calculate exact probabilities here.

Should my data set get too much larger, the distribution info may also come in handy. Thanks again, all.

posted by chrisamiller at 7:39 PM on August 14, 2008

**a robot made of meat**'s function does that quickly, which is exactly what I was looking for.Should my data set get too much larger, the distribution info may also come in handy. Thanks again, all.

posted by chrisamiller at 7:39 PM on August 14, 2008

Response by poster: Oh, and yes, I know they aren't really

posted by chrisamiller at 7:41 PM on August 14, 2008

*exact*probabilities, due to machine rounding, but I only need a few places of precision.posted by chrisamiller at 7:41 PM on August 14, 2008

This thread is closed to new comments.

I think this function is O(2^N), and it's recursive, so you might need to optimize it somewhat if you have literally

thousandsof buckets.posted by mbrubeck at 10:28 PM on August 13, 2008