Calculating a binomial distribution confidence interval for a particular experiment
March 12, 2012 12:28 PM Subscribe
[StatisticsFilter] I’m trying to understand the extent that the statics of a particular experiment can be evaluated and trying to relate that to a 95% confidence interval. The experiment is considered “a standard”, but I want to keep the discussion as “high level” as I can so that we focus on only the relevant details. Any help or pointers in the right direction are greatly appreciated.
Currently, we test a component design to see if it will fail under certain conditions. Three “test articles” of the component design (or units off of an assembly line) are tested and that test is repeated 20 times on each test article (which is the requirement of the standard). The results are classified as “pass” or “fail”. A component passes the test standard when all three test articles pass the 20 repeat tests with no observed failure. The test is a standard test, and it is important that we understand the statistical limitations of the test. What I’m asking for is that; please help me understand the statistical limitations of the test. I’m not looking for advice on how to improve the test standard which is what it is.
My understanding, based on my limited knowledge of statistics, is as follows. Since the results can be classified into pass or fail results, the distribution model would be a binomial distribution. If a single test article fails 0 times out of the 20 repeat tests, then I can estimate that probability as 0/20 fail rate or p ~ 0. Further, I can estimate a 95% confidence interval and using the Clopper-Pearson “Exact CI” method I have estimated an interval of 0 to 0.168. Other confidence intervals result in about the same value. I have interpreted this 95% confidence interval to mean that with the limited number of repeat tests done on this particular test article I can conclude that the test article has a failure rate < 16.8% the vast majority of the time (19 times out of 20). Do I have the correct understanding so far?
Since I then test 3 test articles of the same component design and none of the test articles fail, then I can say with 95% confidence that each test article has a failure rate of <16.8%. Another way that I would like to say this is that my worst case failure probability of a test article was 16.8% with 95% confidence (can I say it that way?).
The bigger question is what all of this says about my component design. If I assume that my three test articles were identical, then I essentially did the equivalent of testing 1 test article 60 times. The 95% confidence interval I get for that would be < 6%. But that is not how the experiment is set up, and I don’t think that they are the same.
One thought a colleague had was that since each test was independent, then the probabilities of each test could be multiplied together (as like the probability of rolling three six sided dice and receiving all ones could be multiplied together). This results in about 1:250 or 0.5% failure rate. This results in a much smaller 95% confidence limit than the 1 test article, 60 times, and I find that questionable but I’m not sure.
Thank you for your time.
Currently, we test a component design to see if it will fail under certain conditions. Three “test articles” of the component design (or units off of an assembly line) are tested and that test is repeated 20 times on each test article (which is the requirement of the standard). The results are classified as “pass” or “fail”. A component passes the test standard when all three test articles pass the 20 repeat tests with no observed failure. The test is a standard test, and it is important that we understand the statistical limitations of the test. What I’m asking for is that; please help me understand the statistical limitations of the test. I’m not looking for advice on how to improve the test standard which is what it is.
My understanding, based on my limited knowledge of statistics, is as follows. Since the results can be classified into pass or fail results, the distribution model would be a binomial distribution. If a single test article fails 0 times out of the 20 repeat tests, then I can estimate that probability as 0/20 fail rate or p ~ 0. Further, I can estimate a 95% confidence interval and using the Clopper-Pearson “Exact CI” method I have estimated an interval of 0 to 0.168. Other confidence intervals result in about the same value. I have interpreted this 95% confidence interval to mean that with the limited number of repeat tests done on this particular test article I can conclude that the test article has a failure rate < 16.8% the vast majority of the time (19 times out of 20). Do I have the correct understanding so far?
Since I then test 3 test articles of the same component design and none of the test articles fail, then I can say with 95% confidence that each test article has a failure rate of <16.8%. Another way that I would like to say this is that my worst case failure probability of a test article was 16.8% with 95% confidence (can I say it that way?).
The bigger question is what all of this says about my component design. If I assume that my three test articles were identical, then I essentially did the equivalent of testing 1 test article 60 times. The 95% confidence interval I get for that would be < 6%. But that is not how the experiment is set up, and I don’t think that they are the same.
One thought a colleague had was that since each test was independent, then the probabilities of each test could be multiplied together (as like the probability of rolling three six sided dice and receiving all ones could be multiplied together). This results in about 1:250 or 0.5% failure rate. This results in a much smaller 95% confidence limit than the 1 test article, 60 times, and I find that questionable but I’m not sure.
Thank you for your time.
"I have interpreted this 95% confidence interval to mean that with the limited number of repeat tests done on this particular test article I can conclude that the test article has a failure rate < 16.8% the vast majority of the time (19 times out of 20). Do I have the correct understanding so far?"
Not quite. The proper interpretation is that there is a 95% chance that the CI contains the true failure rate.
"One thought a colleague had was that since each test was independent, then the probabilities of each test could be multiplied together (as like the probability of rolling three six sided dice and receiving all ones could be multiplied together)."
If the assumption of independence is correct, then this is true. This is one way of calculating an exact P-value for a binomial distribution.
"This results in about 1:250 or 0.5% failure rate."
This doesn't sound right to me. How did you get that result?
posted by mikeand1 at 1:25 PM on March 12, 2012 [1 favorite]
Not quite. The proper interpretation is that there is a 95% chance that the CI contains the true failure rate.
"One thought a colleague had was that since each test was independent, then the probabilities of each test could be multiplied together (as like the probability of rolling three six sided dice and receiving all ones could be multiplied together)."
If the assumption of independence is correct, then this is true. This is one way of calculating an exact P-value for a binomial distribution.
"This results in about 1:250 or 0.5% failure rate."
This doesn't sound right to me. How did you get that result?
posted by mikeand1 at 1:25 PM on March 12, 2012 [1 favorite]
"If you're only ever looking at the cases with zero failures, you're interested in the hypothesis that P[Bin(n; θ) ≥ 0] ≥ 0.05, so I think a one-tailed test would be appropriate. That means your CI is only [0, 13.9%]: "
I agree with this. Also, note how this corresponds to the second part of your question: You get the same result simply by multiplying the probabilities together.
E.g., suppose the probability of non-failure is 1 - 13.9% = 0.861. Note that 0.861^20 = 5%, corresponding to the upper bound of the 95% confidence interval.
posted by mikeand1 at 1:29 PM on March 12, 2012 [1 favorite]
I agree with this. Also, note how this corresponds to the second part of your question: You get the same result simply by multiplying the probabilities together.
E.g., suppose the probability of non-failure is 1 - 13.9% = 0.861. Note that 0.861^20 = 5%, corresponding to the upper bound of the 95% confidence interval.
posted by mikeand1 at 1:29 PM on March 12, 2012 [1 favorite]
I've been thinking about how to model your ultimate problem a little more.
In terms of coin tosses, the binomial distribution describes the probability of k successes from flipping a coin n times, or from flipping n identical coins. But you want the probability of k1, k2, k3 successes from flipping n supposedly identical coins m times each. I don't know how to model that. Hopefully a statistician who does will pop in.
But in reality, the success rate of items off the assembly line are unlikely to be independent of each other. It is more likely to be some sort of Markov process, as where once something about the line changes, all the subsequent articles are likely to be affected in a similar way. I don't know anything about your testing process, but it seems possible to me that the results of one will affect the next as well, as the article is increasingly stressed during testing.
So I'm not sure how well any of the assumptions behind a binomial model fit what is going on here. I'd look for a textbook on statistical process control if you want more explanation for this.
posted by grouse at 1:39 PM on March 12, 2012 [1 favorite]
In terms of coin tosses, the binomial distribution describes the probability of k successes from flipping a coin n times, or from flipping n identical coins. But you want the probability of k1, k2, k3 successes from flipping n supposedly identical coins m times each. I don't know how to model that. Hopefully a statistician who does will pop in.
But in reality, the success rate of items off the assembly line are unlikely to be independent of each other. It is more likely to be some sort of Markov process, as where once something about the line changes, all the subsequent articles are likely to be affected in a similar way. I don't know anything about your testing process, but it seems possible to me that the results of one will affect the next as well, as the article is increasingly stressed during testing.
So I'm not sure how well any of the assumptions behind a binomial model fit what is going on here. I'd look for a textbook on statistical process control if you want more explanation for this.
posted by grouse at 1:39 PM on March 12, 2012 [1 favorite]
Response by poster: @mikeand1 - I went with a 0.168*0.168*0.168 ~ 0.0047. Is that not right?
Thanks for all the insight. I see how a one-tailed approach would be better. I'm still digesting everyone's posts...
posted by nickerbocker at 1:49 PM on March 12, 2012
Thanks for all the insight. I see how a one-tailed approach would be better. I'm still digesting everyone's posts...
posted by nickerbocker at 1:49 PM on March 12, 2012
Response by poster: "I don't know anything about your testing process, but it seems possible to me that the results of one will affect the next as well, as the article is increasingly stressed during testing."
@grouse - The results of one test performed on a test article would not affect the testing done on the next test article. I will look up Markov process and see what it states.
I guess, what I'm after is: if a component passes my testing (no failure, because I had 0 failures in 20 tests for 3 separate test articles) what is the most I can say about the component in a statistical sense? I can't just say, it passes, therefore it won't ever fail (obviously). But is there a way to say: this thing passes and I would expect based on this limited amount of data that the failure rate exists somewhere between this interval?
posted by nickerbocker at 1:56 PM on March 12, 2012
@grouse - The results of one test performed on a test article would not affect the testing done on the next test article. I will look up Markov process and see what it states.
I guess, what I'm after is: if a component passes my testing (no failure, because I had 0 failures in 20 tests for 3 separate test articles) what is the most I can say about the component in a statistical sense? I can't just say, it passes, therefore it won't ever fail (obviously). But is there a way to say: this thing passes and I would expect based on this limited amount of data that the failure rate exists somewhere between this interval?
posted by nickerbocker at 1:56 PM on March 12, 2012
Best answer: The results of one test performed on a test article would not affect the testing done on the next test article.
That's a response to a different concern from the one you quoted. But I'm not saying that the testing process itself applied to one article will affect the next article. I'm saying that the results are not independent of each other. For example, a machine on the assembly line might become misaligned halfway through production, and every article produced after that point will fail the tests. That means the success rates of the articles are not independent, and so the assumptions behind a binomial test fail (they also fail because you are sampling test articles without replacement).
posted by grouse at 2:02 PM on March 12, 2012 [2 favorites]
That's a response to a different concern from the one you quoted. But I'm not saying that the testing process itself applied to one article will affect the next article. I'm saying that the results are not independent of each other. For example, a machine on the assembly line might become misaligned halfway through production, and every article produced after that point will fail the tests. That means the success rates of the articles are not independent, and so the assumptions behind a binomial test fail (they also fail because you are sampling test articles without replacement).
posted by grouse at 2:02 PM on March 12, 2012 [2 favorites]
Best answer: "I went with a 0.168*0.168*0.168 ~ 0.0047. Is that not right?"
No - that is a consequence of your misinterpreting the meaning of the confidence interval.
The easiest solution is, again, just to multiply the probabilities together. You have an instance of 60 non-failures. Note that .9513^60 = 5%. The outer bound of your CI for the probability of failure is then 1 - .9513 = 4.87%.
posted by mikeand1 at 2:06 PM on March 12, 2012 [1 favorite]
No - that is a consequence of your misinterpreting the meaning of the confidence interval.
The easiest solution is, again, just to multiply the probabilities together. You have an instance of 60 non-failures. Note that .9513^60 = 5%. The outer bound of your CI for the probability of failure is then 1 - .9513 = 4.87%.
posted by mikeand1 at 2:06 PM on March 12, 2012 [1 favorite]
Response by poster: Hi mideand1 - So do you interpret the 20 non-failures of 3 test articles the same as 60 non-failures? Does it make no difference then, statistically, how I get the 60 non-failures? I.e., would 5 test articles tested 12 times each be equivalent as well?
Thanks again for your time.
posted by nickerbocker at 2:16 PM on March 12, 2012
Thanks again for your time.
posted by nickerbocker at 2:16 PM on March 12, 2012
Best answer: "So do you interpret the 20 non-failures of 3 test articles the same as 60 non-failures? Does it make no difference then, statistically, how I get the 60 non-failures? I.e., would 5 test articles tested 12 times each be equivalent as well?"
Assuming the result of each test is IID (independent and identically distributed), yes.
posted by mikeand1 at 2:35 PM on March 12, 2012 [1 favorite]
Assuming the result of each test is IID (independent and identically distributed), yes.
posted by mikeand1 at 2:35 PM on March 12, 2012 [1 favorite]
Best answer: Assuming the result of each test is IID (independent and identically distributed)
But that is surely a bad assumption here. And it's the reason why your standard makes you test three separate articles rather than one sixty times.
posted by grouse at 2:37 PM on March 12, 2012 [1 favorite]
But that is surely a bad assumption here. And it's the reason why your standard makes you test three separate articles rather than one sixty times.
posted by grouse at 2:37 PM on March 12, 2012 [1 favorite]
The bigger question is what all of this says about my component design.
Okay, what this means to me is that you're interested in how often this design will fail over the population of produced articles. If you produce 100,000 articles, how many will fail over the relevant period of time?
First: grain of salt time. This sort of industrial control application may be one of those particular things that gets complex and introspective really, really fast, so you should probably really talk to someone in that area.
Grain of salt taken... Your goal means two steps, because there are two sources of variability here. One is in the within-article failure rate -- how often does an examined article fail? You've been estimating that. At a guess, you might benefit from testing until failure. At any rate, this lets you say that the failure rate for article 1 is X +/- CI, and that the failure rate for article 2 is Y +/- CI (or, if failure means the article breaks, time to failure is X or Y +/- CI). At another guess, you might find it useful to refer to the raw underlying measurements, if those are good, rather than the pass/fail.
But not all articles will be the same, especially not if they're put into mass production. What you really want to find is what the average failure rate is and how failure rates are distributed. What you need to do that is test more articles, and there is absolutely no substitute for that. If you want to make inferences about the population of articles, you need to have a reasonable sample from that population.
I essentially did the equivalent of testing 1 test article 60 times.
No, you measured three articles twenty times each.
One thought a colleague had was that since each test was independent, then the probabilities of each test could be multiplied together (as like the probability of rolling three six sided dice and receiving all ones could be multiplied together).
The tests aren't independent. They are clustered in three articles.
posted by ROU_Xenophobe at 2:43 PM on March 12, 2012 [3 favorites]
Okay, what this means to me is that you're interested in how often this design will fail over the population of produced articles. If you produce 100,000 articles, how many will fail over the relevant period of time?
First: grain of salt time. This sort of industrial control application may be one of those particular things that gets complex and introspective really, really fast, so you should probably really talk to someone in that area.
Grain of salt taken... Your goal means two steps, because there are two sources of variability here. One is in the within-article failure rate -- how often does an examined article fail? You've been estimating that. At a guess, you might benefit from testing until failure. At any rate, this lets you say that the failure rate for article 1 is X +/- CI, and that the failure rate for article 2 is Y +/- CI (or, if failure means the article breaks, time to failure is X or Y +/- CI). At another guess, you might find it useful to refer to the raw underlying measurements, if those are good, rather than the pass/fail.
But not all articles will be the same, especially not if they're put into mass production. What you really want to find is what the average failure rate is and how failure rates are distributed. What you need to do that is test more articles, and there is absolutely no substitute for that. If you want to make inferences about the population of articles, you need to have a reasonable sample from that population.
I essentially did the equivalent of testing 1 test article 60 times.
No, you measured three articles twenty times each.
One thought a colleague had was that since each test was independent, then the probabilities of each test could be multiplied together (as like the probability of rolling three six sided dice and receiving all ones could be multiplied together).
The tests aren't independent. They are clustered in three articles.
posted by ROU_Xenophobe at 2:43 PM on March 12, 2012 [3 favorites]
How are we (the readers, that is) supposed to know if the tests are independent or not?
We know nothing about what these devices are, or what the nature of the test is.
For the OP: Since you are the one who knows more about what these devices are, you might think about whether the tests are independent or not. Two events, A and B, are statistically independent if and only if Prob(A given B) = Prob (A), and Prob(B given A) = Prob (B).
It's somewhat more reasonable to think that two test on two different devices are independent. But as someone pointed out, what if repeatedly testing one device makes it more likely to fail by stressing it over time?
posted by mikeand1 at 3:35 PM on March 12, 2012
We know nothing about what these devices are, or what the nature of the test is.
For the OP: Since you are the one who knows more about what these devices are, you might think about whether the tests are independent or not. Two events, A and B, are statistically independent if and only if Prob(A given B) = Prob (A), and Prob(B given A) = Prob (B).
It's somewhat more reasonable to think that two test on two different devices are independent. But as someone pointed out, what if repeatedly testing one device makes it more likely to fail by stressing it over time?
posted by mikeand1 at 3:35 PM on March 12, 2012
Best answer: "The tests aren't independent. They are clustered in three articles."
So what? Take three fair coins, and flip each of them 20 times. The results should be no different, statistically, than flipping one coin 60 times.
posted by mikeand1 at 3:37 PM on March 12, 2012
So what? Take three fair coins, and flip each of them 20 times. The results should be no different, statistically, than flipping one coin 60 times.
posted by mikeand1 at 3:37 PM on March 12, 2012
Best answer: With a coin metaphor, the whole point of this would be to figure out whether they are fair or not, with a suspicion that they may not actually be fair coins. So an analogy to three fair coins doesn't work.
posted by grouse at 3:43 PM on March 12, 2012
posted by grouse at 3:43 PM on March 12, 2012
As long as the three coins are identical, it doesn't matter.
posted by mikeand1 at 3:52 PM on March 12, 2012
posted by mikeand1 at 3:52 PM on March 12, 2012
"But they aren't."
Well I don't know how you know this, but the OP asked us to assume they are when he said:
"If I assume that my three test articles were identical..."
posted by mikeand1 at 4:00 PM on March 12, 2012
Well I don't know how you know this, but the OP asked us to assume they are when he said:
"If I assume that my three test articles were identical..."
posted by mikeand1 at 4:00 PM on March 12, 2012
Best answer: You're right, he did ask us to assume that, but I think that is an unreasonable assumption for any environment where this kind of testing is needed.
posted by grouse at 4:08 PM on March 12, 2012
posted by grouse at 4:08 PM on March 12, 2012
Response by poster: It's somewhat more reasonable to think that two test on two different devices are independent. But as someone pointed out, what if repeatedly testing one device makes it more likely to fail by stressing it over time?
I completely agree that the assumption that these three test articles are identical is a false one. My interest in stating that they were identical was merely to isolate the discussion to the test methodology and the statistics that were produced from it (i.e., if the test articles are identical is there any real difference between testing 1 60 times vs 3 test articles 20 times).
It is also highly likely that the test methodology introduces increasing amount of stress in each cycle.
Now, as far as "what are these components" we don't ever know. We just do the test per-the-standard.
So if I can get any input in the treatment of non-identical test articles (the extent of which I do not know) and the fact that each successive test probably increases the likelihood of failure (the extent of which I do not know), is there anyway to handle this statistically?
posted by nickerbocker at 4:27 PM on March 12, 2012
I completely agree that the assumption that these three test articles are identical is a false one. My interest in stating that they were identical was merely to isolate the discussion to the test methodology and the statistics that were produced from it (i.e., if the test articles are identical is there any real difference between testing 1 60 times vs 3 test articles 20 times).
It is also highly likely that the test methodology introduces increasing amount of stress in each cycle.
Now, as far as "what are these components" we don't ever know. We just do the test per-the-standard.
So if I can get any input in the treatment of non-identical test articles (the extent of which I do not know) and the fact that each successive test probably increases the likelihood of failure (the extent of which I do not know), is there anyway to handle this statistically?
posted by nickerbocker at 4:27 PM on March 12, 2012
I don't think you can say much statistically. The repeated tests on the test article change the probability of failure with each successive test, so the binomial (number of successes in n trials) or negative binomial (number of successes until a failure is observed) won't apply because they assume that the probability of failure is constant with each repeated test.
It seems like the proper thing would be to do some kind of accelerated failure time analysis, but to do this you would have to test more articles (20?) and test them until they break (or some number such that most of them break). However, this is not the standard procedure you were given and is well outside the scope of your question.
At this point, I think all you can say is that you tried 3 articles, and none of them failed under 20 repeated tests.
posted by everythings_interrelated at 6:05 PM on March 12, 2012 [1 favorite]
It seems like the proper thing would be to do some kind of accelerated failure time analysis, but to do this you would have to test more articles (20?) and test them until they break (or some number such that most of them break). However, this is not the standard procedure you were given and is well outside the scope of your question.
At this point, I think all you can say is that you tried 3 articles, and none of them failed under 20 repeated tests.
posted by everythings_interrelated at 6:05 PM on March 12, 2012 [1 favorite]
For reference, you might want to check out
http://www.itl.nist.gov/div898/handbook/pmc/section1/pmc1.htm">this chapter from the NIST Engineering Stats handbook on process control and testing strategies.
posted by scalespace at 6:50 PM on March 12, 2012 [1 favorite]
posted by scalespace at 6:50 PM on March 12, 2012 [1 favorite]
Best answer: Arrgh: This chapter from the NIST handbook... (apologies for the borked HTML)
posted by scalespace at 6:52 PM on March 12, 2012 [1 favorite]
posted by scalespace at 6:52 PM on March 12, 2012 [1 favorite]
It may be useful to frame the question in the following way:
What can I say about the probability of any given outcome (or collection of outcomes) if you were to continue measuring different articles? I.e. what is the probability, if I measure a fourth article 20 times, that it will fail at least once? What about a fifth article? Another 500?
At some point you will have to introduce some assumptions--you've already done so with the binomial distribution. Maybe those assumptions are justified through theory or experiment, or maybe it's a fancy Bayesian model with a Jeffrey's prior, or maybe your gut just tells you this should work. Just make sure that you've done something reasonable and that your model does a good job describing some independent data. And if it doesn't, figure out why.
Do keep in mind the following, though: with only three samples your predictions are very likely to be wrong no matter what you do.
posted by dsword at 7:01 PM on March 12, 2012 [1 favorite]
What can I say about the probability of any given outcome (or collection of outcomes) if you were to continue measuring different articles? I.e. what is the probability, if I measure a fourth article 20 times, that it will fail at least once? What about a fifth article? Another 500?
At some point you will have to introduce some assumptions--you've already done so with the binomial distribution. Maybe those assumptions are justified through theory or experiment, or maybe it's a fancy Bayesian model with a Jeffrey's prior, or maybe your gut just tells you this should work. Just make sure that you've done something reasonable and that your model does a good job describing some independent data. And if it doesn't, figure out why.
Do keep in mind the following, though: with only three samples your predictions are very likely to be wrong no matter what you do.
posted by dsword at 7:01 PM on March 12, 2012 [1 favorite]
Response by poster: Thanks for all the discussion, guys. I'm still wrapping my head around this.
One quick question though: I'm pretty sure that the "test standard" I have been referring to makes the assumption that each test cycle is "identical". We have data that shows to the contrary in that by following the test standard each additional cycle adds heat to the test article which probably would result in a condition that would make it more likely to fail.
That being said, if a test article passes all 20 cycles would taking the assumption that each cycle was the same result in a "more conservative" estimation of the statistics? Not sure if that is the right way to say that.
Anyway, I've marked the answers that I felt challenged my thinking the most. I appreciate everyone's input into the discussion and thank you for your time.
posted by nickerbocker at 9:00 AM on March 13, 2012
One quick question though: I'm pretty sure that the "test standard" I have been referring to makes the assumption that each test cycle is "identical". We have data that shows to the contrary in that by following the test standard each additional cycle adds heat to the test article which probably would result in a condition that would make it more likely to fail.
That being said, if a test article passes all 20 cycles would taking the assumption that each cycle was the same result in a "more conservative" estimation of the statistics? Not sure if that is the right way to say that.
Anyway, I've marked the answers that I felt challenged my thinking the most. I appreciate everyone's input into the discussion and thank you for your time.
posted by nickerbocker at 9:00 AM on March 13, 2012
This thread is closed to new comments.
If you're only ever looking at the cases with zero failures, you're interested in the hypothesis that P[Bin(n; θ) ≥ 0] ≥ 0.05, so I think a one-tailed test would be appropriate. That means your CI is only [0, 13.9%]: But this should only tell you the confidence interval for that article, not your whole production run.
You get more data from testing three articles 20 times each than one article 60 times.
posted by grouse at 1:17 PM on March 12, 2012 [1 favorite]