How many to (probably) catch 'em all?
September 10, 2010 12:10 PM   Subscribe

Statistics question: How many tries are necessary for 99.9% chance of full coverage of a heterogenous population?

Here are the specifics. We have a library of 4000 pieces of DNA, each one different, but they are all mixed together. How many samples do we have to take (technically, how many colonies should we plate) to have a 99.9% chance of representing each sample at least once?

I found an equation that indicated this would work:

N = ln(1-P)/ln(1-f)

in which P is the desired probability and f is the fractional proportion. But plugging 0.999 for P and 1/4000 for f returns 28 thousand, which seems quite low to us intuitively.

In our minds, this seems like saying that if you have 4 species and you pick 7 times, you will get them 999 times out of a thousand. Again, it smells wrong.

Thanks in advance for your time and help.
posted by Jorus to Science & Nature (26 answers total) 2 users marked this as a favorite
 
Response by poster: Sorry - to clarify:

if you have 4 species and you pick 7 times, each will be represented at least once 999 times out of a thousand - that smells wrong.
posted by Jorus at 12:12 PM on September 10, 2010


Best answer: I'm kicking myself for not being able to come up with the right formula here, but I just did a quick Monte-carlo simulation and after a 100 trials, came up with an average of 35202.67 picks before you find all 4000.


Here's the ruby code:
class Array
  def sum
    return self.inject(0){|sum,item| sum + item}
  end
end

#prep all 4000 numbers
a = {}
0.upto(3999){|i|
  a[i] = 0
}
  
count = 0
sum = 0
#flip each one when it's found, run until we've found them all
while (a.values.sum < 4000)
  a[rand(4000)] = 1
  count += 1
end
  
puts count
posted by chrisamiller at 12:48 PM on September 10, 2010


This paper gives the probability of number of zero slots (p271), which you can plug in zero for, and search for number of draws until you get the probability you want. The search term for more modern methods is "multinomial distribution minimum", I didn't see software for the new ones. Trying Rao's formula quickly runs into numerical problems for large n when I threw it naively into R. Some thought or throwing in stirling's approximation for the factorial might make it not die.

Other papers that might tell you how to do it:
"Saddlepoint Approximation for Multivariate Cumulative Distribution Functions and Probability Computations in Sampling Theory and Outlier Testing"
"Saddle-point Methods for the Multinomial Distribution"
"The exact distribution of the maximum, minimum and the range
of Multinomial/Dirichlet and Multivariate Hypergeometric
frequencies"
posted by a robot made out of meat at 1:04 PM on September 10, 2010


This isn't my area of expertise, and so I'm taking a shot in the dark, but it sounds very similar to the Coupon Collector's Problem.
In probability theory, the coupon collector's problem describes the "collect all coupons and win" contests. It asks the following question: Suppose that there are n coupons, from which coupons are being collected with replacement. What is the probability that more than t sample trials are needed to collect all n coupons?
The formula for calculating the expected value is straightforward. The Wikipedia article also points out how to calculate variance, and then how to bound to a specific probability.
posted by jasonhong at 1:25 PM on September 10, 2010


Best answer: Although I definitely agree that like most probability problems in applied work, you should just simulate for the answer.
posted by a robot made out of meat at 1:26 PM on September 10, 2010


Best answer: I also ran a Monte Carlo simulation in Mathematica and got results very similar to chrisamiller. In addition, you seem to have made an error in your smaller case. If you want to get a 99.9% probability of sampling all of four species, the formula suggests you'd need to pick ln(.001)/ln(.75) or about 24 times--not 7 times as suggested above. I can't source your formula easily, but the results for the smaller case seem intuitively correct, and the results from the bigger case line up well with the Monte Carlo exercise, so I'd be inclined to go with it.
posted by muhonnin at 1:29 PM on September 10, 2010


Think about the chance of missing a single member of the mixture. Each time you pick, the chance is (n-1)/n or 3999/4000 that you failed to pick that member. If you pick twice, the chance is now (3999/4000)^2. You'd like the probability to be 0.1% that this single member is missed after all the picking so

(3999/4000)^x = 0.001

Take the log of both sides, use the rules of exponents and you get

x = (log 0.001)/(log (3999/4000)) = 27,630

So if you sample the mixture of 4000 DNA sequences 27,630 times, you can be assured that you have roughly hit 3999.
posted by Durin's Bane at 1:32 PM on September 10, 2010


Best answer: I can't source your formula easily

The source is perhaps this paper, which is for one genome.
posted by Blazecock Pileon at 1:32 PM on September 10, 2010


If your question hasn't been answered already, can you clarify the experimental setup?

When you plate a colony, there is only one 'piece' of DNA in that colony?
There are 4000 unique pieces of DNA; are there multiple copies of each piece / strand?
Are all pieces of DNA equally likely to be picked from the 'soup' of DNA?
posted by JumpW at 1:33 PM on September 10, 2010


Coupon collector's problem is the negative binomial vs binomial contrast. CCP asks "given a target, how many draws" whereas what you want is "given a number of draws, how many failures?".
posted by a robot made out of meat at 1:37 PM on September 10, 2010


Durin's Bane: he wants that for every piece. At the least you'd need to adjust for the 4000 opportunities for failure. In fact they aren't independent, which is what makes the calculation difficult. Assuming independence might be a good enough approximation for applied work with large n.
posted by a robot made out of meat at 1:41 PM on September 10, 2010


I don't know if I've completely understood the set-up of your problem. Your formula may be rewritten as P = 1 - (1-f)N. This indicates to me that the problem that relates to this formula is something like this: If each event has a probability of success of f (for some definition of "event" and"success"), N is the number of events we'll need to experience to ensure that we obtain at least one success. This does not seem to jive with your description of "representing each sample at least once" (emph. added), so I'm not sure that this is the correct formula to apply. Moreover, setting f=1/4000 seems to indicate that whatever event you're looking at (taking a sample? from where?), has a probability of success (what does "success" mean here?) of 1/4000. Is this a correct model for your situation?

Would it be possible for you to rephrase your question in more abstract terms (e.g.: balls in urns)?
posted by mhum at 1:43 PM on September 10, 2010


all right, I ran 1000 trials done, using the above method. The mean value is 35548.73 and the median 34846, but what you're most interested in is the 99.9th percentile, which falls at 52085.

So, if you want to be 99.9% sure, you'll need to take about 52k samples.

Tell you what, just for the hell of it, I'll run another 9k trials to refine the numbers a little more, but these are going to be reasonably accurate.
posted by chrisamiller at 1:50 PM on September 10, 2010


I wrote something up in Python that may or may not be correct. It's pretty hacky because you have to adjust "times" manually (since if you attempted to do it automatically my code is horrendously slow), but answers in the 58,000 range seem to give you a 99.9% chance of sampling every piece of DNA (at least) once.
posted by ccrazy88 at 2:01 PM on September 10, 2010


Addendum: After running this some more I think I would bump the number of trials up to 60,000 or so, just to be safe.
posted by ccrazy88 at 2:06 PM on September 10, 2010


Again, I'm not an expert in this area, so a robot made of meat may be correct. However, if you take the equation in the Coupon Collector's page on Wikipedia, you get:

4000*(ln 4000)+0.577*4000 +0.5 = 35,484 trials

Which seems to be match the other mean values that others have posted above.
posted by jasonhong at 2:13 PM on September 10, 2010


Best answer: The reason why the "mean" value as given by the coupon collector's equation is incorrect is because it represents the expected number of trials after which you'd have sampled all of the pieces of DNA. This is not the same as being 99.9% certain that after you reach 35,484 trials or so, you will have sampled each piece of DNA at least once. Instead, it means that if you were to keep carrying out this problem to completion, on average, you would be finished once you reached 35,484 trials. However, for any given sampling, the real point at which you had all pieces of DNA may have been before or after 35,484 trials.

chrisamiller, by taking the 99.9th percentile of his samples, is on the right track. Since the numbers are quite large though, he should throw more trials at the problem to combat the large variance in the samples. I'm pretty confident that the answer will converge in the 60,000 range.
posted by ccrazy88 at 3:04 PM on September 10, 2010


Response by poster: I agree that my first analogy was wrong. It should be 28 tries for 4 species, which sounds less crazy.

Apologies for not sourcing the original equation. I will add the primary literature cite tomorrow morning, but I found it in the Sambrook & Russell chapter on making a genomic library.
posted by Jorus at 3:18 PM on September 10, 2010


Response by poster: JumpW, each colony is assumed to have one piece. All pieces are assumed to have an equal possibility of resulting in a colony (though in actuality this may not be true). There are many copies of each piece in the soup.
posted by Jorus at 3:25 PM on September 10, 2010


Best answer: RIght - ccrazy has it - the mean value given by coupon collector won't work here.

I finished 10k trials, and the 99.9th percentile is at 64058.

So if you do 65k samples, then you can be pretty confident that you're getting them all.
posted by chrisamiller at 3:50 PM on September 10, 2010


Best answer: I ran a simulation, and the 99.9th percentile of the 50K trials had each kind of piece after 60904 samples.
posted by parudox at 6:52 PM on September 10, 2010


Yes, the mean won't work here because it's the mean value, sorry I thought that was really obvious so I didn't state that.

However, my point was that if my mean is matching other people's mean expected value, then that means that it is likely that my conjecture that this is the coupon collector's problem is correct, and that there is a possibly a closed formula for *guaranteeing* 99.9% without having to do simulations.

For example, the Wikipedia page points out how to calculate upper bounds, which would be a mathematically proven bound.

At any rate, this is more of an academic point, since the simulations will probably be faster.
posted by jasonhong at 7:36 AM on September 11, 2010


Response by poster: Thank you all for the help. It's amazing, BP even correctly the primary source for my original equation.

I'll recommend that we aim for 60-65k.
posted by Jorus at 7:41 AM on September 11, 2010


Response by poster: er, correctly identified the primary source.
posted by Jorus at 7:42 AM on September 11, 2010


Best answer: Oh, of course when the number of samples is large compared to the number of slots the Poisson approximation will work. If you want to do it again at some other p of missing none, define N as the number of genes in the pool and M the number of samples you take, then

M = N ln(-N/ ln(p))

Plugging it in with N=4000 and p=.999
4000*log(-4000/log(.999))
[1] 60805.22

With 65k samples you get p = 0.99965.
posted by a robot made out of meat at 8:06 PM on September 11, 2010 [1 favorite]


Response by poster: Thanks - it's always nice when independent methods come to a similar answer.
posted by Jorus at 6:04 AM on September 13, 2010


« Older Slow-mo voice driving me crazy   |   How do I automate the conversion of a HTML page to... Newer »
This thread is closed to new comments.