April 22, 2007 5:31 PM Subscribe

[StatsFilter] Which statistical technique do I need and how do I use it?

Believe it or not, this is not a student's homework problem. I work in a biology lab and have made some observations. Now I need some stats to help me proceed. I'm just not sure which technique to use and how set up the correct formula. Alas, its been nearly a decade since my Intro to Statistics class. Online tutorials are short on example applications.

Here's what I got:

Of 635 cells in population A, 39 are "neato" positive. (That makes a 5.67% frequency I wish to assume for the below populations)

Of 109 cells in population B, 0 are "neato" positive.

Of 3 cells in population C, 0 are "neato" positive.

How significant is this zero result in population B? (i.e. how would I set up a formula to determine it's p-value? Or is there something else I need to do to determine significance?)

Lastly, the zero result in population C is obviously not very important in light of the extremely small sample size, how many more cells would I need to observe to make it significant considering the 5.67% frequency of population A?

Thanks MeFi! Answers are great, but so is pointing me in the right direction.
posted by dendrite to Science & Nature (20 answers total)

Believe it or not, this is not a student's homework problem. I work in a biology lab and have made some observations. Now I need some stats to help me proceed. I'm just not sure which technique to use and how set up the correct formula. Alas, its been nearly a decade since my Intro to Statistics class. Online tutorials are short on example applications.

Here's what I got:

Of 635 cells in population A, 39 are "neato" positive. (That makes a 5.67% frequency I wish to assume for the below populations)

Of 109 cells in population B, 0 are "neato" positive.

Of 3 cells in population C, 0 are "neato" positive.

How significant is this zero result in population B? (i.e. how would I set up a formula to determine it's p-value? Or is there something else I need to do to determine significance?)

Lastly, the zero result in population C is obviously not very important in light of the extremely small sample size, how many more cells would I need to observe to make it significant considering the 5.67% frequency of population A?

Thanks MeFi! Answers are great, but so is pointing me in the right direction.

I'm assuming that the cells can either be neato^{+} or neato^{–}, not that neato is some sort of degree of positive, of which there are more than two.

Anyway, this looks like a good use of the chi-square test. Here's how to do it in the free R statistical language:

posted by grouse at 5:52 PM on April 22, 2007

Anyway, this looks like a good use of the chi-square test. Here's how to do it in the free R statistical language:

`> chisq.test(matrix(c(39, 635-39, 0, 109), ncol=2))`

Pearson's Chi-squared test with Yates' continuity correction

data: matrix(c(39, 635 - 39, 0, 109), ncol = 2)

X-squared = 5.8825, df = 1, p-value = 0.01529

posted by grouse at 5:52 PM on April 22, 2007

Observed vs. expected frequencies on a qualitative variable "neato-ness" brings the very simple statistic chi-square to mind.

posted by Rumple at 5:53 PM on April 22, 2007

posted by Rumple at 5:53 PM on April 22, 2007

I'm no stats person and dont even know what a p-value is but it seems that what you are asking is "Given that the probability is 5.67%, what is the chance of getting 0 in a population of 109?" Right?

If so, you can use a binomial probability calculator. n=109, k=0, p=.0567. If so it looks like there is a 1% chance of getting less than 1 in a normal distribution in a population of 109.

posted by vacapinta at 5:54 PM on April 22, 2007

If so, you can use a binomial probability calculator. n=109, k=0, p=.0567. If so it looks like there is a 1% chance of getting less than 1 in a normal distribution in a population of 109.

posted by vacapinta at 5:54 PM on April 22, 2007

All excellent answers so far.

Yes, grouse...cells are either neato+ or neato- and not something inbetween.

posted by dendrite at 6:03 PM on April 22, 2007

Yes, grouse...cells are either neato+ or neato- and not something inbetween.

posted by dendrite at 6:03 PM on April 22, 2007

Seems like binomial might be better than chi-squared, no? Maybe a p-hat confidence test?

posted by jckll at 6:08 PM on April 22, 2007

posted by jckll at 6:08 PM on April 22, 2007

The problem with an exact binomial test is that it assumes that the 5.67% level is a true value rather than something estimated from a sample of 635. If you use a chi-square test, your confidence increases as you sample more cells from population A or B. With an exact binomial test, it only increases as you sample more cells from population B.

posted by grouse at 6:23 PM on April 22, 2007

posted by grouse at 6:23 PM on April 22, 2007

Well he did say that we're talking about populations here...in which case you could assume that 5.67% is a true value, if he checked every member of population A, as I read the question.

Or it's possible that OP is misusing "population" in the statistical sense.

posted by jckll at 6:26 PM on April 22, 2007

Or it's possible that OP is misusing "population" in the statistical sense.

posted by jckll at 6:26 PM on April 22, 2007

"Misuse" seems a bit harsh to me, since I never read that the OP was trying to portray these samples as being the whole population. Interpreting it as such doesn't make any sense—if you say that the whole of population A only contains 635 cells, then ipso facto any cells in population B cannot be members of population A.

posted by grouse at 6:33 PM on April 22, 2007

posted by grouse at 6:33 PM on April 22, 2007

I should say that the binomial answer is correct if you really want to assume that the 5.67% frequency is a true value, as the question is written. But I think that assumption is wrong, and I am interpreting it as an aid to understanding the question rather than the soul of it.

posted by grouse at 6:37 PM on April 22, 2007

posted by grouse at 6:37 PM on April 22, 2007

Not to get too chatty here, but...you may very well be right, I'm no statistical expert just remember a few things here and there. The way I read the question was 3 separate populations, OP wants to assume that the proportion is the same across the 3 populations, and wants to test that hypothesis.

Of course it could be 3 samples from a larger population, that's just not how I read it.

posted by jckll at 6:38 PM on April 22, 2007

Of course it could be 3 samples from a larger population, that's just not how I read it.

posted by jckll at 6:38 PM on April 22, 2007

For what its worth: Populations A,B,and C are mutually exclusive. A member of one cannot be a member of another.

The number "635" of Population A is the total number of cells of a particular type in a section of tissue. I counted all

members of population A in the entire section on the slide.

So for the purpose of the statistical test, yes I checked every member of population A.

However, a slide of tissue is only a sample from the entire organ in the animal. So if we're speaking biologically, then 635 is a miniscule fraction of all of those cell types in the animal.

Thanks for the answers so far.

posted by dendrite at 7:45 PM on April 22, 2007

The number "635" of Population A is the total number of cells of a particular type in a section of tissue. I counted all

members of population A in the entire section on the slide.

So for the purpose of the statistical test, yes I checked every member of population A.

However, a slide of tissue is only a sample from the entire organ in the animal. So if we're speaking biologically, then 635 is a miniscule fraction of all of those cell types in the animal.

Thanks for the answers so far.

posted by dendrite at 7:45 PM on April 22, 2007

It's not entirely clear from the question what your null hypothesis is. Assuming that the null hypothesis is "There is no difference in the proportion of N+ cells between Population A and Population B" the correct test is a Fisher Exact Test. You cannot use a Chi-square for this data because some cells have counts of 0, which violates the assumptions for this test (see here).

Plugging your numbers into the first site linked shows that the test rejects the null hypothesis with p < .004.

posted by myeviltwin at 8:53 PM on April 22, 2007

Plugging your numbers into the first site linked shows that the test rejects the null hypothesis with p < .004.

posted by myeviltwin at 8:53 PM on April 22, 2007

If you're at a university, you probably can get statistical help on campus.

posted by mandymanwasregistered at 9:03 PM on April 22, 2007

posted by mandymanwasregistered at 9:03 PM on April 22, 2007

So is what you're asking: How confident can I be that there really are fewer cells-of-interest in group B, and that it's not just a quirk of how the tissue samples were drawn?

If that's all you want to know, a chi-square is a fine start.

If this is at all serious, you'll want to take clustering or the hierarchical nature of the data into account, even in something as simple as a chi-square. That is, presumably your first group doesn't have one cell each from 635 different subjects, but rather has some varying number of cells from some number N of subjects. What this means is that you don't really have 635 independent observations, you have N observations. There are ways to deal with this -- multilevel modeling or hierarchical linear modeling -- that have ANOVA / chisquare components as well as multivariate components.

---DANGER WILL ROBINSON---

From your most recent response, it sounds like you had one slide with 39 of 635 cells that were interesting, and one other slide with 0 of 109 cells that were interesting. If this is so, you only have two data points, and you can't do anything useful with that. You need to count cells in a whole damn bunch more slides.

posted by ROU_Xenophobe at 10:19 PM on April 22, 2007

If that's all you want to know, a chi-square is a fine start.

If this is at all serious, you'll want to take clustering or the hierarchical nature of the data into account, even in something as simple as a chi-square. That is, presumably your first group doesn't have one cell each from 635 different subjects, but rather has some varying number of cells from some number N of subjects. What this means is that you don't really have 635 independent observations, you have N observations. There are ways to deal with this -- multilevel modeling or hierarchical linear modeling -- that have ANOVA / chisquare components as well as multivariate components.

---DANGER WILL ROBINSON---

From your most recent response, it sounds like you had one slide with 39 of 635 cells that were interesting, and one other slide with 0 of 109 cells that were interesting. If this is so, you only have two data points, and you can't do anything useful with that. You need to count cells in a whole damn bunch more slides.

posted by ROU_Xenophobe at 10:19 PM on April 22, 2007

myeviltwin: ISTR that chisquares can deal with zeroes that are a simple result of sampling (there is positive probability of the cell having at least 1, but it happens not to), but not with "structural" zeroes where there is zero probability of the cell being filled. But I spend so much time in the regression-oriented world that I would not trust myself very far.

posted by ROU_Xenophobe at 10:28 PM on April 22, 2007

posted by ROU_Xenophobe at 10:28 PM on April 22, 2007

Wait a second. I've looked at my share of microscope slides, and I see a fundamental problem here. How exactly are you counting every cell on your slide? Even if you're looking at stained nuclei, there's only a few cases where the result is definite enough that you have +/- 1 cell accuracy.

If you counted another slide of population A, how would you calculate the frequency? How would you draw the boundaries to count within? You could have sections of different size, too. You need to come up with a cells/cm^{2} metric, count a whole bunch more slides from each population, and then worry about your stats.

posted by Mr. Gunn at 8:02 AM on April 23, 2007

If you counted another slide of population A, how would you calculate the frequency? How would you draw the boundaries to count within? You could have sections of different size, too. You need to come up with a cells/cm

posted by Mr. Gunn at 8:02 AM on April 23, 2007

To clarify, there are 635 cells of a particular type on a slide that has 10,000 total cells on it.

posted by dendrite at 12:20 PM on April 23, 2007

posted by dendrite at 12:20 PM on April 23, 2007

Yeah, but you still don't have 635 or 10,000 pieces of independent information (unless your data-generating process is really weird).

Unless biology is really weird or it's somehow the case that each cell has a probability of being interesting that is utterly independent of whether other cells in the body are interesting, you only have as many pieces of information as you have subjects (or organs or samples or whatever). In a CSV, your data would look like this:

ObservationNumber,NumberCellsOverall,NumberInterestingCells

1,635,39

2,109,0

3,3,0

There's just not a lot you can do with this.

posted by ROU_Xenophobe at 1:12 PM on April 23, 2007

Unless biology is really weird or it's somehow the case that each cell has a probability of being interesting that is utterly independent of whether other cells in the body are interesting, you only have as many pieces of information as you have subjects (or organs or samples or whatever). In a CSV, your data would look like this:

ObservationNumber,NumberCellsOverall,NumberInterestingCells

1,635,39

2,109,0

3,3,0

There's just not a lot you can do with this.

posted by ROU_Xenophobe at 1:12 PM on April 23, 2007

This thread is closed to new comments.

posted by dendrite at 5:34 PM on April 22, 2007