Ze ball, if you pleaseAugust 7, 2008 11:55 AM   Subscribe

Statistics Filter: How to pick a red ball from n buckets, when there could be multiple red balls in one or more of the buckets, and multiple bucket sets?

I am looking to calculate the significance of finding a red ball in a generic bucket, when I have n buckets.

Currently, I employ a sampling method to generate a z-score:

1. I go through my n buckets methodically and look for the observed frequency of a red ball in all buckets.

2. I calculate an expected frequency by shuffling or shaking a bucket thousands of times and trying to find a red ball in the bucket. (In reality, these are not red balls but a particular substring of letters. One metaphor is that shaking the bucket might cause a red ball to turn green, or vice versa.) This is sampling without replacement -- i.e., a permutation test.

My z-score = red ballobserved - red ballexpected / s.d. red ballexpected.

I can use this approach to generate z-scores for finding red balls from different sets of buckets (say, bucket set A and bucket set B), with different numbers of buckets.

I would like to compare (rank) z-scores for red balls between bucket-sets A and B, however.

I find that the observed frequency is complicated by situations where more than one red ball is found in bucket, e.g. a bucket in set A may have three red balls, and another bucket may have none. Additionally, I will likely be dealing with different numbers of buckets between two or more sets of buckets.

Are there strategies for correcting the observed and expectations with these complications in mind, so that I can generate comparable significance scores?
posted by Blazecock Pileon to Science & Nature (11 answers total)

Is this something real, or are you actually concerned about red balls in buckets?

Because the way you go about doing similar things might vary for different kinds of variables, or what you're describing with red balls and buckets might actually be something simple and canned in Stata or R.
posted by ROU_Xenophobe at 12:26 PM on August 7, 2008

Shit, never mind. Delete that last.
posted by ROU_Xenophobe at 12:27 PM on August 7, 2008

Response by poster: Is this something real, or are you actually concerned about red balls in buckets?

I'm just trying to simplify the description of the problem. The "red balls" are really a pattern of letters, derived from a four-letter alphabet. I might find that pattern as a substring in a string of letters (my "bucket"). I will have multiple strings ("multiple buckets") which are all derived from one "super-bucket" (a really long string).

posted by Blazecock Pileon at 12:36 PM on August 7, 2008

Best answer: ignoring any issues of bias in your expected values, this seems to be equivalent to old-fashioned chi-squared. if that's the case, then you'd compare sum(z^2)/n across different numbers of buckets (your z is just a log-likelihood, assuming gaussian errors).

for small numbers of balls that's not going to work because you're assuming it's gaussian when it's not. it's poisson? there's a correction for that, but i can't remember what it is.

if this is any use (i'm not an expert on stats in general, but this seems to be equivalent to image analysis in astronomy when you have small numbers of photons) i can find out more about the correction later tonight (ie ask my partner :o).

also, is this relevant? i found it here. it addresses comparing numbers of words in different collections.
posted by not sure this is a good idea at 1:02 PM on August 7, 2008

I'm just trying to simplify the description of the problem. The "red balls" are really a pattern of letters, derived from a four-letter alphabet.

Do you mean you're analyzing DNA and trying to find the significance of seeing a particular DNA fragment in different sections of a genome?

Or do you mean that you're analyzing text?

Because the two would have very different null data generating processes.

In generalalities, I'd say that:

First, a z-value or another statement of significance is always only relative to some specific null data generating process. A z-value or other statement of significance says "It would be THIS unlikely to see the value that I observed GIVEN THAT the data were actually generated by the null data generating process, so we should think that some other process generated the data."

Second, a null can be anything reasonable. What's reasonable can depend on the specifics of what you're studying. If it's DNA, I don't know whether a null of simple random selection is sensible, or a null that says "every organism of this species has the same string of base pairs here." Likewise, a string of actual text might have yet some different sensible null.

Third, when in doubt, simulate the null.
posted by ROU_Xenophobe at 1:04 PM on August 7, 2008

Also more generally, depending on the assumptions you're willing to make about the data, you might be able to use canned techniques if you're analyzing the red balls as a dependent variable.

If you think of it as "there's a red ball, or there ain't," then you can use any of the dichotomous techniques (logit probit gompit scobit), depending on how you feel about the other assumptions these models make about the data.

If you think of it as "there's generally either no red balls, one red ball, two red balls, etc, but not generally a whole fucking bunch of red balls," then you can use count models (poisson regression, negative binomial regression), again depending on how you feel about the other assumptions these models make about the data.

But both of those kinds of techniques assume that you're trying to analyze variation in something, not just establish that there is variation.

If you just want to say "I think there really are more red balls in this bucket than in that other bucket, and not just from random fluctuation," then I'd just generate a million buckets using my null data generating process and compare my actual buckets to the distribution of simulated buckets, or generate a million pairs of buckets using your null data generating process and look at the difference directly. Or it might be that it's standard in your field to make a couple of heroic assumptions and use some canned distribution.
posted by ROU_Xenophobe at 1:13 PM on August 7, 2008

I would like to compare (rank) z-scores for red balls between bucket-sets A and B, however.

Use your sampling procedure to get p-scores instead (i.e. expected proportion of samples yielding a red ball.) Depending on your goals, these may be more comparable.

Be careful about using shuffling as a null hypothesis in a text processing situation. In biological sequence analysis, shuffling is often a very poor null hypothesis, leading to severe overestimation of significance.
posted by Coventry at 1:18 PM on August 7, 2008

Response by poster: Do you mean you're analyzing DNA and trying to find the significance of seeing a particular DNA fragment in different sections of a genome?

Not just a fragment, but a pattern of fragments, e.g. one pattern might be `agccNcct`. Given the alphabet {A, C, G, T} there are four matches: `agccAcct`, `agccCcct`, `agccGcct`, and `agccTcct`.

I am trying to estimate the significance of that pattern, but the frequency of that pattern is complicated by how different "buckets" carry that pattern, not to mention different "buckets" between sets (specifically, I have tissue cell types that derive different sets).

I'd just generate a million buckets using my null data generating process and compare my actual buckets to the distribution of simulated buckets, or generate a million pairs of buckets using your null data generating process and look at the difference directly.

Okay, I would generate random buckets (substrings of similar length to my observed buckets) and derive an empirical p-value from that. If I did this for different bucket sets (buckets being of different size and constitution), are the significance results between sets comparable?

In biological sequence analysis, shuffling is often a very poor null hypothesis, leading to severe overestimation of significance.

Is there a reason for this or literature to describe problems? I was always told that random sampling with replacement was less preferable to permutation, for the same reason.

posted by Blazecock Pileon at 1:40 PM on August 7, 2008

I'm trying to show that this pattern appears more often than we think it would if nothing interesting were happening. And I'm trying to show that it appears even more more-often-than-it-should in Tissue Type A than it does in Tissue Type B.

Then, first, you need to specify how often you would see it if nothing interesting were happening. That is, define a null.

Maybe a uniform distribution of the base pairs is a good null. Maybe it isn't, because if nothing interesting were happening we'd expect all organisms to either have it or not have it apart from low rates of mutation (so the null distribution of base pairs would be nonuniform). What the right null is depends on what an uninteresting, but not ridiculously out of place, data-generating process is in your particular context. This part is not a stats question. This is a bio question.

The second problem seems trickier. You're sort of asking "I get a p-value of .08 for tissue type A, and .04 for tissue type B. Is this a significant difference?" So you're asking what are the confidence bounds around your confidence bounds.

Again, this is a bio question not a stats one, but you might find it easier, if it makes sense in bio, to ask a different question: I get the pattern 5 times in tissue type A and eight times in B. Is this difference significant? This you can easily simulate, once you nail down your null process, by generating a joint null distribution for A and B. That is, each iteration draws an A and a B. Then you have a distribution of A-and-B that you can look at to gauge the significance of your particular A-and-B.
posted by ROU_Xenophobe at 2:24 PM on August 7, 2008

Best answer: Is there a reason for this or literature to describe problems?

The reason is that shuffling destroys local structure which makes matches more likely. There's not much literature on it. Even though it's clearly a serious problem, most people sweep it under the rug and use a per-site i.i.d. model anyway (which is even worse than shuffling) because no one really knows what to do about it. Significance tests with more sophisticated null models always seem to benchmark as less sensitive.
posted by Coventry at 3:40 PM on August 7, 2008

Best answer: If you are dealing with vertebrate sequence, you might want to sample dinucleotides rather than single nucleotides to deal with the marked difference in CpG frequency to C frequency * G frequency.

As Coventry said, however, a really good answer to this question might not exist yet.
posted by grouse at 10:26 AM on August 8, 2008

« Older Help me find this fictional evil organization logo...   |   Reinstalling Windows on Laptop Without CD Drive Newer »