Probability question (in need of code)August 13, 2008 10:04 PM   Subscribe

It's late, I'm tired, and I have a probability and coding question that's fairly simple.

Say I have 5 buckets (A,B,C,D,E), with different colored balls in them. The probability of removing a red ball is different for each bucket:

A: 1/3
B: 1/4
C: 1/16
D: 1/6
E: 1/9

If I draw 1 ball from each bucket, what is the probability (P) of drawing at least N red balls?

Doing it for small numbers of buckets is easy enough, but I have thousands of buckets here. Given that I know the probabilities for each bucket, what's the easiest way to calculate P? I suspect that there's an R function that makes this a breeze, but I'm having trouble tracking it down.

I can do basic operations in R, and I'm also open to functions or pseudo code from your favorite language. Ruby and Perl preferred, but I can use others if they get the job done.

For good measure, here's a preemptive "This isn't homework-filter"
posted by chrisamiller to Computers & Internet (10 answers total) 1 user marked this as a favorite

Best answer: Here's a simple recursive function in Ruby, which passes some simple test cases:

def p(buckets, n)
if n == 0
return 1
end
if buckets.empty?
return 0
end
p0 = buckets.shift
return (p0     * p(buckets, n-1)) +
((1-p0) * p(buckets, n  ))
end
In English: The probability of N red balls is equal to the probability of 1 red ball from the first bucket and N-1 from the rest, plus the probability of 0 red balls from the first bucket and N from the rest.

I think this function is O(2^N), and it's recursive, so you might need to optimize it somewhat if you have literally thousands of buckets.
posted by mbrubeck at 10:28 PM on August 13, 2008

A minor clarification, just for the sake of pedantry:

The probability of at least N red balls is equal to the probability of 1 red ball from the first bucket and at least N-1 from the rest, plus the probability of 0 red balls from the first bucket and at least N from the rest.
posted by mbrubeck at 10:49 PM on August 13, 2008

Best answer: The process you're describing should asymptotically approach a Poisson distribution with lambda equal to n times the average of all your p (equivalent to the sum of all p). You can use 1 - ppois(x, lambda) in R to get a pretty good approximation of the probability of drawing more than x balls.
posted by shadow vector at 11:02 PM on August 13, 2008 [2 favorites]

Best answer: There is no exact named result for the sum of non-iid bernullis. If p is a list of probabilities and you are sampling once from each bucket:

Re(fft(apply(mvfft(rbind(1-p, p, 0 * p[-1] %o% p)), 1, prod), inverse = TRUE) / (length(p) + 1))

will give you the exact distribution in R (I think). If you have THOUSANDS, you're probably better off using asymptotic methods or simulating it, as others above suggested.
posted by a robot made out of meat at 5:03 AM on August 14, 2008

Oh, the output of that command is the density, ie if you did x={command} then x[i] is the probability of exactly i balls. If you wanted the probability of at least i balls, that would be 1-sum(x[1:(i-1)]). With 5000 random probabilities, on my 2002 POS machine that took ten minutes (because of swapping -needed about 1.5 GB memory) , and a single modern processor on my server did the same job in about 50 seconds. Still, there are obvious machine-precision artifacts so it's not exact anymore. For example, it says that the minimum probability is about -1x10^-16, when it should be 10^-1500.
posted by a robot made out of meat at 7:54 AM on August 14, 2008

extending shadow vector, that will also tend to a gaussian (ie normal dist) with mean np and variance npq, where p is the sum of probabilities and q=1-p.
posted by not sure this is a good idea at 8:53 AM on August 14, 2008

If you want to simulate it, put all your probabilities in a vector called p_i and do this:
totals <>
successes <>
for (h in 1:1000) {
for (k in 1:(length(p_i))) {
successes[k] <>
}
totals[h] <>
}
The totals vector will now contain the number of balls drawn in 1000 simulated runs. You can increase this number by making h run for more iterations. Totals will approach the distribution of interest, and for the purposes of quantiles and the like you can pull them directly from the simulated data and they'll be very close to accurate.

Warning: with several thousand elements in p_i this may take some non-trivial computer time.
posted by shadow vector at 11:49 AM on August 14, 2008

Ah, whoops, HTML ate my less thans. Here:
totals <- NULL
successes <- NULL
for (h in 1:1000) {
for (k in 1:(length(p_i))) {
successes[k] <- rbinom(1,1,p_i[k])
}
totals[h] <- sum(successes)
}

posted by shadow vector at 11:52 AM on August 14, 2008

Response by poster: Thanks, guys. I know how to use simulations to approximate answers, but it occurred to me that I had enough information to calculate exact probabilities here. a robot made of meat's function does that quickly, which is exactly what I was looking for.

Should my data set get too much larger, the distribution info may also come in handy. Thanks again, all.
posted by chrisamiller at 7:39 PM on August 14, 2008

Response by poster: Oh, and yes, I know they aren't really exact probabilities, due to machine rounding, but I only need a few places of precision.
posted by chrisamiller at 7:41 PM on August 14, 2008

« Older Help finding a new place to live?   |   working with tar archives... Newer »