Rates of success?
November 4, 2009 9:46 AM Subscribe
Statistics question: is it possible to test sets of cumulative data for significant differences in rate?
I have three cumulative percentage graphs, measuring the germination rates of three different seed types. Is there a way to compare them and see if there are any statistically significant differences?
The seed types were planted in triplicate, on three dishes each (nine overall). Every day for the past few weeks I've observed how many seeds on each dish have begun germinating -- so for an individual dish I would have " Day 1: 0 ... Day 7: 14 ... Day 14: 29" etc, with each day's score a cumulative total. (There are 100 seeds on each dish, so it works as a percentage rate as well)
In Excel, I've graphed the average germination rates of the replicates, for a graph that resembles this one. (with three lines plotted, and x-axis = time in days, y-axis = percent germinated).
So is there a way to compare these different rates statistically? I can use Excel, Minitab, SPSS, and R.
I have three cumulative percentage graphs, measuring the germination rates of three different seed types. Is there a way to compare them and see if there are any statistically significant differences?
The seed types were planted in triplicate, on three dishes each (nine overall). Every day for the past few weeks I've observed how many seeds on each dish have begun germinating -- so for an individual dish I would have " Day 1: 0 ... Day 7: 14 ... Day 14: 29" etc, with each day's score a cumulative total. (There are 100 seeds on each dish, so it works as a percentage rate as well)
In Excel, I've graphed the average germination rates of the replicates, for a graph that resembles this one. (with three lines plotted, and x-axis = time in days, y-axis = percent germinated).
So is there a way to compare these different rates statistically? I can use Excel, Minitab, SPSS, and R.
I am not a statistician (I'm a biologist), but what I think you should do is fit a Generalised Linear Model which includes "germination rate" as your response variable and "days" and "treatment" (plus the interaction between them) as explanatory variables. The model should be specified to be binomial with a logit link to take care of the fact that your values for germination are bounded (between 0 and 100).
Then you can test the significance of the terms in your model by deletion (if the term is significant, then removing it will bugger up your model in terms of residual variation). If the interaction is significant then the effect of "days" depends on the treatment. Which is the same as saying that treatment has an effect.
This is easy to do in R. I will memail you some hastily put together code containing a worked example that you should be able to cut and paste into the R console. If you want to read up on this then have a look at Chapter 16 in The R Book by MJ Crawley, which features some similar examples (the relevant sections are on Google Books I think).
I've tried to post it here but the system mangles it...
posted by jonesor at 11:08 AM on November 4, 2009
Then you can test the significance of the terms in your model by deletion (if the term is significant, then removing it will bugger up your model in terms of residual variation). If the interaction is significant then the effect of "days" depends on the treatment. Which is the same as saying that treatment has an effect.
This is easy to do in R. I will memail you some hastily put together code containing a worked example that you should be able to cut and paste into the R console. If you want to read up on this then have a look at Chapter 16 in The R Book by MJ Crawley, which features some similar examples (the relevant sections are on Google Books I think).
I've tried to post it here but the system mangles it...
posted by jonesor at 11:08 AM on November 4, 2009
MeMail also mangles the code - I think it's trying to strip out what it things is HTML.
Here's a link to a Google Doc instead.
posted by jonesor at 11:12 AM on November 4, 2009
Here's a link to a Google Doc instead.
posted by jonesor at 11:12 AM on November 4, 2009
Much depends on the seriousness of purpose behind this. In general, if this is for an academic or industrial purpose, you should find out how your discipline deals with this problem and do it that way.
Otherwise, anytime you can define a null process sufficiently well, you can test against it.
What are you trying to find out?
I mean, I know that you're trying to compare lines like in the figure. Why? What are you trying to learn? Are you trying to learn whether one seed type has a higher or lower failure rate than another? Then the graph probably doesn't matter, and only the endpoint of "How many seeds didn't germinate?" does. Are you trying to learn whether there's a seed-type difference in how many seeds have germinated at a particular point in time? Then the other time points aren't relevant. Are you trying to learn whether seed type A has a steeper slope up than seed type B? Then you might throw away the graphs and instead model time-to-germination.
How you want to deal with this will depend on what you're actually trying to learn. It might be as simple as a half-assed difference-of-proportions t-test out of 300 seeds per type. Or, if you have individual seed-level information on germination, you could run a survival or duration model.
The simplest thing to do would be, for each day, to find the 95% or 90% confidence interval around the proportion of seeds that have germinated, and plot those intervals on your graph.
Doing this half-assed is as simple as calculating sqrt(proportiongerminated * proportionnotgerminated/300) for each seed type each day, which would be your standard error. Then for a 95% or 90% CI, your resulting CI is just your sample proportion of germinated seed +/- 1.96 or 1.64 standard errors, respectively. Doing this non-half-assed would mean taking into account that the samples at time T are depending on the samples at time T-1, and that the seeds on a given tray aren't fully independent of each other, and so on.
posted by ROU_Xenophobe at 11:15 AM on November 4, 2009
Otherwise, anytime you can define a null process sufficiently well, you can test against it.
What are you trying to find out?
I mean, I know that you're trying to compare lines like in the figure. Why? What are you trying to learn? Are you trying to learn whether one seed type has a higher or lower failure rate than another? Then the graph probably doesn't matter, and only the endpoint of "How many seeds didn't germinate?" does. Are you trying to learn whether there's a seed-type difference in how many seeds have germinated at a particular point in time? Then the other time points aren't relevant. Are you trying to learn whether seed type A has a steeper slope up than seed type B? Then you might throw away the graphs and instead model time-to-germination.
How you want to deal with this will depend on what you're actually trying to learn. It might be as simple as a half-assed difference-of-proportions t-test out of 300 seeds per type. Or, if you have individual seed-level information on germination, you could run a survival or duration model.
The simplest thing to do would be, for each day, to find the 95% or 90% confidence interval around the proportion of seeds that have germinated, and plot those intervals on your graph.
Doing this half-assed is as simple as calculating sqrt(proportiongerminated * proportionnotgerminated/300) for each seed type each day, which would be your standard error. Then for a 95% or 90% CI, your resulting CI is just your sample proportion of germinated seed +/- 1.96 or 1.64 standard errors, respectively. Doing this non-half-assed would mean taking into account that the samples at time T are depending on the samples at time T-1, and that the seeds on a given tray aren't fully independent of each other, and so on.
posted by ROU_Xenophobe at 11:15 AM on November 4, 2009
Kolmogrov-Smirnov test. Scroll down the page to the two-sample K-S test for a description. Alternately, a cursory googling shows that someone once came up with a three-sample K-S test.
This will only tell you whether the distributions are different, not how they are different (i.e., if one germinates faster than another).
posted by logicpunk at 11:39 AM on November 4, 2009 [1 favorite]
This will only tell you whether the distributions are different, not how they are different (i.e., if one germinates faster than another).
posted by logicpunk at 11:39 AM on November 4, 2009 [1 favorite]
It's unfortunate that seed type is nested within dish, since it's possible that any "seed type" effect you observe could actually be a difference in the dishes.
If you can assume independence between seeds in the same dish (what the other seeds in the same dish have done tells you nothing about what this seed is doing other than the average for seeds of that type), then this sounds like a pretty standard survival analysis problem since a seed can only germinate once and is then out of the "risk set" for germinating. Survival data analysis can be a little involved, but something like a 3 sample log-rank test might do what you want, although you probably have many tied event times. I have no idea if the independence assumption is reasonable for seeds in a dish. You can do survival analysis in R. In your data, all seeds that sprout would have "died" and there would be a row corresponding to each germination. Just have multiple rows, 1 for each seed that germinated on a particular day. There may be parametric models (weibull, etc) which would fit your data; it's impossible to know without seeing it.
ROU is of course right that this only makes sense if questions like "how quickly do they germinate" are what is interesting. If what one seed has done impacts what the other seeds in the dish are doing, the problem is more complicated. (could you model the hazard in terms of "dish density?"). (could you throw in a random effect for dish? There exists a mixed proportional hazards regression package for R, but who knows if it works).
posted by a robot made out of meat at 11:44 AM on November 4, 2009
If you can assume independence between seeds in the same dish (what the other seeds in the same dish have done tells you nothing about what this seed is doing other than the average for seeds of that type), then this sounds like a pretty standard survival analysis problem since a seed can only germinate once and is then out of the "risk set" for germinating. Survival data analysis can be a little involved, but something like a 3 sample log-rank test might do what you want, although you probably have many tied event times. I have no idea if the independence assumption is reasonable for seeds in a dish. You can do survival analysis in R. In your data, all seeds that sprout would have "died" and there would be a row corresponding to each germination. Just have multiple rows, 1 for each seed that germinated on a particular day. There may be parametric models (weibull, etc) which would fit your data; it's impossible to know without seeing it.
ROU is of course right that this only makes sense if questions like "how quickly do they germinate" are what is interesting. If what one seed has done impacts what the other seeds in the dish are doing, the problem is more complicated. (could you model the hazard in terms of "dish density?"). (could you throw in a random effect for dish? There exists a mixed proportional hazards regression package for R, but who knows if it works).
posted by a robot made out of meat at 11:44 AM on November 4, 2009
Kolmogorov-Smirnov, as logicpunk suggested, is exactly the correct test for answering the question 'is the distribution of time-to-germination different for seeds in dish A vs dish B'? Note that if the statistical answer is 'not enough evidence to reject the null hypothesis of different distributions,' you are OK. However, if you have enough evidence to reject the null, it doesn't tell you in what ways the distributions are different and if those differences are something you care about.
The basic idea behind KS is as follows. Lets say you want to ask the question "After 3 days, is the number of sprouted plants in A significantly different than in B?" This would correspond to testing for the equality of means of binomials. Now, instead of fixing the time at 3 days, also ask the question for 4 days, 5 days and 3.127 days. You are essentially asking a bunch of related questions, so doing many binomial-mean tests would be incorrect -- but the Kolmogorov-Smirnov test essentially corrects for this.
Model-based tests, as suggested by jonesor (and others, I think) will be less correct if the data does not fit your modeling assumptions (and data almost never fits assumptions). Furthermore, the germination rate at different times is not an independent process, so you have to be very careful in how you fit your model or you'll get complete garbage out.
posted by bsdfish at 3:18 PM on November 4, 2009
The basic idea behind KS is as follows. Lets say you want to ask the question "After 3 days, is the number of sprouted plants in A significantly different than in B?" This would correspond to testing for the equality of means of binomials. Now, instead of fixing the time at 3 days, also ask the question for 4 days, 5 days and 3.127 days. You are essentially asking a bunch of related questions, so doing many binomial-mean tests would be incorrect -- but the Kolmogorov-Smirnov test essentially corrects for this.
Model-based tests, as suggested by jonesor (and others, I think) will be less correct if the data does not fit your modeling assumptions (and data almost never fits assumptions). Furthermore, the germination rate at different times is not an independent process, so you have to be very careful in how you fit your model or you'll get complete garbage out.
posted by bsdfish at 3:18 PM on November 4, 2009
Since a couple of people have said KS test, I have an opinion on using generic GOF tests when something more specific might be appropriate:
1) unmodified KS is inappropriate if you have censoring (you stop the experiment before all the seeds have germinated); your package may not deal with that (the generic R doesn't)
2) the p-values need to be evaluated by permutation since the data are discrete; your package again may or may not do that
3) KS provides no useful parameters or interpretation beyond "different / not different". It's really nice to be able to say "the median germination time for group 1 was X (95%CI A-B) and for group 2 Y (C-D), a difference significant at the p=SMALL level." As ROU pointed out, you want to be able to tell what kind of difference you've detected. If group 1 germinates way better on day 1, but by day 2 the difference is gone, do you care? If the difference is a few outliers, do you care? If you have a really precise experiment, you want to know if the difference is important or not, not just if you can detect it, since the EXACT null hypothesis is approximately never true.
4) As a corollary to the above, a KS test has no obvious way of saying if the difference is consistent across plates or not. I suppose that you could evaluate a p-value on (dish 1) vs (all others group 2), but again I have no way of interpreting that.
5) The KS test can be relatively efficient or terrible compared to the specific alternative depending on what you've looking for. You can't beat having an idea of what you're looking for to start with!
I think that non-parametric tests like KS are fine to start with; after all, you'd definitely begin the analysis by looking at the cumulative distributions. They're just not a good end point, mostly because of 3. Somebody's going to ask you what that test statistic means.
posted by a robot made out of meat at 10:45 AM on November 5, 2009
1) unmodified KS is inappropriate if you have censoring (you stop the experiment before all the seeds have germinated); your package may not deal with that (the generic R doesn't)
2) the p-values need to be evaluated by permutation since the data are discrete; your package again may or may not do that
3) KS provides no useful parameters or interpretation beyond "different / not different". It's really nice to be able to say "the median germination time for group 1 was X (95%CI A-B) and for group 2 Y (C-D), a difference significant at the p=SMALL level." As ROU pointed out, you want to be able to tell what kind of difference you've detected. If group 1 germinates way better on day 1, but by day 2 the difference is gone, do you care? If the difference is a few outliers, do you care? If you have a really precise experiment, you want to know if the difference is important or not, not just if you can detect it, since the EXACT null hypothesis is approximately never true.
4) As a corollary to the above, a KS test has no obvious way of saying if the difference is consistent across plates or not. I suppose that you could evaluate a p-value on (dish 1) vs (all others group 2), but again I have no way of interpreting that.
5) The KS test can be relatively efficient or terrible compared to the specific alternative depending on what you've looking for. You can't beat having an idea of what you're looking for to start with!
I think that non-parametric tests like KS are fine to start with; after all, you'd definitely begin the analysis by looking at the cumulative distributions. They're just not a good end point, mostly because of 3. Somebody's going to ask you what that test statistic means.
posted by a robot made out of meat at 10:45 AM on November 5, 2009
Response by poster: I'm a little delayed in putting this stuff together, but thanks for your responses!
I think ROU_Xenophobe had the best suggestion re: checking the literature. I don't know why that wasn't my first thought. Most of the info I've found has been pretty relevant.
But I'll also be trying Kolmogorov-Smirnov and jonesor's Generalised Linear Model code -- I have a copy of The R Book bouncing around, so if I need more help on that end I should be set.
Thanks again!
posted by rollick at 1:26 PM on November 9, 2009
I think ROU_Xenophobe had the best suggestion re: checking the literature. I don't know why that wasn't my first thought. Most of the info I've found has been pretty relevant.
But I'll also be trying Kolmogorov-Smirnov and jonesor's Generalised Linear Model code -- I have a copy of The R Book bouncing around, so if I need more help on that end I should be set.
Thanks again!
posted by rollick at 1:26 PM on November 9, 2009
« Older What organizations should I join in the Boston... | Problems saving web pages to PDF in Firefox Newer »
This thread is closed to new comments.
Generally speaking, this is the question you want to ask before you begin the experiment.
Is there a control group in the experiment? What should a controll group look like? What does a group of 100 RANDOM seeds look like over the same timeframe?
Why did you separate each batch in to 3 separate dishes? What information does this tell you?
What is the mix of the 3 separate breeds?
Were they allowed to cross-polinate? (is this a consideration?)
Is there a variation in soil, temp, exposure to light, fertilizer, water, etc?
I would not feel comfortable making any statements regarding two seeds interacting without knowing how they perform under varying conditions and establishing a baseline performance for each.
Right now though, you have two variables: Time and Rate. You could try to see if there is a correlation between the two. You could see if there is linear association between the two... you could do some regression modeling...
I think what you wanted to do was some Analysis of Variance (ANOVA), but I'm not not sure what you might look at.... maybe snapshots of Week 1, Week 2, Week 3, etc and breed? I'm reaching here...
posted by Nanukthedog at 10:34 AM on November 4, 2009