A Strange Bucket
November 6, 2008 1:39 AM   Subscribe

I have a bucket containing N marbles: M white marbles and N-M black marbles. I need to grab a handful of marbles (n) and figure out the probability of having picked up m white marbles.

At first, I thought I could use the hypergeometric distribution. But there's a complication, namely that the white marbles are not equally distributed in my bucket.

In other words, if my handful of marbles contains one white marble, I'm more likely to have picked up one or more additional white marbles in my hand, and this probability is different depending on how many white marbles I may have picked up.

Is there a good approach to modeling or simulating this kind of situation?
posted by Blazecock Pileon to Science & Nature (16 answers total) 6 users marked this as a favorite
My recollection of discrete statistics is pretty fuzzy, but is that going to be some form of a lattice distribution with the span and shift related to N and M? But it seems there's going to be a third parameter that describes the "clumpiness" of the white marbles...
posted by XMLicious at 3:01 AM on November 6, 2008

But there's a complication, namely that the white marbles are not equally distributed in my bucket.

Can you elaborate on this? Formulating a well-posed problem with such assumptions is usually half the battle. Isn't the bucket adequately mixed before picking?
posted by ghost of a past number at 4:24 AM on November 6, 2008

Response by poster: Can you elaborate on this?

The white marbles are "sticky" and tend to cluster together in groups.
posted by Blazecock Pileon at 4:46 AM on November 6, 2008

Sounds like you want to pick them one at a time and write the conditional distribution at each step, then multiply them out.
posted by a robot made out of meat at 4:49 AM on November 6, 2008

I formulated the problem like this:
P(white | last marble was white white and at least one white marble left) = .5
P(white | otherwise) = M'/N' where M' and N' are remaining marbles

and wrote a python program to simulate a bunch of runs:
from random import random

def nextmarble(m, n, l):
    if l and m:
        if random() > .5: return  m-1, n-1, 1
        else:             return    m, n-1, 0
    elif random() > 1. * m / n:
                          return  m-1, n-1, 1
    else:                 return    m, n-1, 0

def handful(N, M, n, m, t=100000):
    s = 0
    for i in range(t):
        Nprime, Mprime = N, M
        l = 0
        c = 0
        for j in range(n):
            Mprime, Nprime, l = nextmarble(Mprime, Nprime, l)
            c += l
        if c >= m: s += 1
    return s

print handful(10, 5, 3, 2)
for 10,5,3,2 the probability is about 0.48 and 100k runs complete very quickly. For 100,50,30,20 the probability is abougt 0.06 and the run took a few seconds.
posted by jepler at 5:09 AM on November 6, 2008 [2 favorites]

To make the problem easier, you might want to start with a jar that is of infinite size, so you don't have to deal with the probabilities changing at each step. Also, it might help to model the clumpiness as a markov process, so the probability of the (N+1)'th marble being white, conditional on the N'th marble being white, is P. Similarly, the probability of the (N+1)'th marble being black conditional on the N'th marble being black, is Q. I believe that if you make these assumptions then the solution with just involve summing a bunch of binomial distributions.
posted by thrako at 5:24 AM on November 6, 2008

The approach suggested by jepler sounds good, you can refine it a little bit by not setting

P(white | last marble was white white and at least one white marble left) = .5

which I think is somewhat unrealistic, but

P(white | last marble was white white and at least one white marble left) = N'/M' + p

where p is some (small?) probability bias for finding two whites in a row.
posted by ghost of a past number at 6:31 AM on November 6, 2008

You could model the bucket by setting up a grid that had N pixels (marbles), with the handgrab randomly sampling a patch of pixels with area size n.

You could try modeling the "stickiness" of the white marbles in the bucket space in a way similar to the Schelling simulators that showed how individual "stickiness" for similar individuals lead to segregation in neighborhoods. Though the original space is a random distribution of the pixels, a white pixel will move about in the space until it has at least one (or 2, or more, depending on the "stickiness" level you pick) other white pixels adjacent to it.

Maybe not what you're looking for, but I'd thought I'd toss the idea your way in case it helps your brainstorming.
posted by neda at 9:40 AM on November 6, 2008

This is an interesting question, BP. I am no math expert, so I try to solve problems like this by the seat of my pants, considering edge cases. If there is only one white marble and the rest are black marbles, it will act exactly like a random sample no matter what "N" is. However, as the proportion of black and white marbles approaches 50/50 it will act less and less like a random sample. Saying that the white marbles are sticky is the same as saying that the black marbles are sticky, since they will "avoid" each other. With a 50/50 distribution in the bucket, the problem of establishing probability varies with the total number of marbles (N) as well as "stickiness". If there is only a handful of marbles in the bucket, then you will get them all with one grab, and if there are "m" white marbles in the bucket, the probability will be one. On the other hand, if there are a million black marbles and a million white marbles, separated completely--say, left and right, or top and bottom, then probability is pretty much zero with one handful. You'd have to "sample" the whole bucket. So probability varies from one to zero. This is why establishing a random sample is so important in polling.
posted by weapons-grade pandemonium at 12:05 PM on November 6, 2008

Response by poster: Many thanks, jepler, I've been playing with your code, but when I compare results with the hypergeometric in a certain case, I don't get the same answer. If you have a moment, I'd be curious what you think.

Let's say that the conditional probability of picking white, given the last pick was white, is 0.5, as you use above.

For a probability of 0.5, I interprete that to mean that there is no bias to picking either white or black, correct? I.e., the white marbles do not "clump" if there are always even odds that the next marble I pick is black or white.

So, in using a conditional probability of 0.5, I think I should expect to get the same approximate result from this simulation as with using the hypergeometric distribution.

For the case (N, M, n, m) = (10, 5, 3, 2), the handful simulation returns ~0.48.

With R, I get the following:
> phyper(2, 5, 5, 3, lower.tail=FALSE)
[1] 0.08333333
Is there something that I'm missing here?
posted by Blazecock Pileon at 1:02 PM on November 6, 2008

Blazecock Pileon, I'm not sure that a conditional probability 0.5 is the same as "unbiased" in a sample as small as 5 white and 5 black. In the unbiased case, you have a 50% chance the first marble is white. If the first marble is black, you have p = 5/9 chance the second marble is white. And if the first two are black, you have p = 5/8 that the third marble is white. So the chance you get "at least one" white marble in three draws is 0.5 + (1-0.5)*5/9 + (1-0.78)*5/8 = 0.92, the complement of your unbiased hypergeometric 0.08.

Trying to verify this with my tools, I got confused about names. This result
> total_marbles = 10;
> white_marbles = 5;
> number_drawn = 3 ;
> number_white = 0 : number_drawn ;
> hygepdf( number_white, total_marbles, white_marbles, number_drawn )
ans =
       0.083333 0.416667 0.416667 0.083333
makes sense if "at least one white marble" means the same thing as "no more than two black marbles." I can tolerate that.

If p = 0.5, it's not a jar of marbles, it's a bunch of coin flips. But they're funny coin flips. After a heads (white), the coin is fair; after a tails (black) the coin is biased depending on how many heads and tails are "left" in the bucket. That's an interesting problem too, but it isn't the one you asked about. You need a better model for your "sticky" marbles.
posted by fantabulous timewaster at 5:32 PM on November 6, 2008

For instance, you might reach into your bucket with tweezers, grab a marble, pull it out. If it's black, it comes out alone. If it's white, it might have a friend or two stuck to it. If you put the extra back in the bucket, this isn't different from the unsticky case. If you throw the extra in the trash, you're biased against the white ones. If you keep the extra as your next marble, the number of marbles you draw from the bucket isn't something you control any more: if the (n-1)th marble is black, and the nth and oops! (n+1)th marbles are white, do you put them all back and start over? The hypergeometric distribution is symmetric if you swap "black" for "white" but not if you change (n,m) -> (n+1,m+1).

I'm curious about the context, if you'd like to elaborate.
posted by fantabulous timewaster at 5:50 PM on November 6, 2008

I have been thinking about this problem, and I haven't found a way to model the situation that isn't essentially arbitrary, so a good answer will really be one that matches the behavior you want.

Here are two other approaches:

(1) Start with the normal (hypergeometric) probabilities p(n) for m draws from a homogeneous bucket. Devise a formula to decrease the probabilities for some values, and increase others. For example, decrease probabilities for n in (m/4,3m/4) and increase elsewhere. This way, it is more likely to get fewer than average (away from a clump) or more than average (in a clump) than normal, but you still get something resembling the normal situation.

(2) Assume the number of balls is large, and the bucket can be thought of as having a continuous distribution of white to black ratio. Postulate some form for this distribution ( perhaps bimodal, with one large peak at low values for most of the bucket and one smaller at high values for clumps ), normalised to the appropriate average ratio. Draw a ratio value at random from this distribution and then do n tests with this probability of success for each ball draw.
posted by ghost of a past number at 11:47 PM on November 6, 2008

Response by poster: I'm curious about the context, if you'd like to elaborate.

I have a genomic sequence, a chromosome ("bucket"). The chromosome has regions ("marbles") on it of a particular type ("white marbles" = regions of interest, "black marbles" = everything else).

I have another categorization that is essentially a sample of the chromosome ("handful of marbles"). I'd like to know what is the likelihood of sampling regions of interest.

The issue is that the regions of interest tend to cluster together. They are, for lack of a better word, "sticky".

So my sample, if it contains a region of interest, will be biased to having more regions of interest than if those regions of interest were randomly dispersed throughout the chromosome.
posted by Blazecock Pileon at 1:36 PM on November 7, 2008

If sampling is random, then the location on the genome doesn't matter and there's no autocorrelation. Imagine that you randomly shuffled the regions to start with. If sampling is a contiguous block then it depends on the distribution of the regions of interest. What you'd be sampling over then would be starting points, and you'd need to be able to specify something about how fun regions scatter.

Would it be sane to say that the interestingness of regions acted like a markov chain with unequal transmission probabilities?
posted by a robot made out of meat at 2:12 PM on November 7, 2008

You might have an XY problem: I am not sure the sticky marble analogy is a good one.

It sounds like you might be analyzing data from some type of shotgun genome sequencing? In that case your genome is a string with N ~ 106 characters. An "interesting region" might run along the chromosome between offsets a and b; your sequencer spits out substrings of length s ~ 1000 with random starting offsets. You have a guess about what total fraction of the chromosome is interesting. In that case you might be conflating two related-but-different questions:
  1. How many sequence fragments overlap an interesting region somewhere?
  2. On a particular sequence fragment, how many interesting regions will you find?
These are questions about two different things. For the first you need to know about the largest clumps of interesting regions (at least, smaller than your samples). I think you're right to use the hypergeometric distribution here. For the second, you need to know something about the length and spacing of interesting regions within their groups. You might try to find (or invent) a correlation function or a structure function for your interesting regions.
posted by fantabulous timewaster at 5:12 PM on November 7, 2008

« Older Toscanini landing in a cornfield   |   Best Distance Learning schools with Computer... Newer »
This thread is closed to new comments.