Bimodal or am I biased?
September 23, 2008 7:53 PM Subscribe
StatisticsFilter: how can I find out whether my data is bimodal?
I have no statistics skills whatsoever, and I would love to have this figured out before I return to campus tomorrow morning. Please help!
Background: I have several sets of 70 numbers each (they represent the lengths of bacterial cells infected with different phages). I want to show that there is a significant difference between the two sets. While my averages look very good, adding error bars negates my findings. For example:
Set 1: Average 6.83, Standard deviation 1.67.
Set 2: Average 4.1, Standard deviation 1.00.
Possible explanation: Let's look at one set at a time. There is a chance that I infected my bacteria with less phages than I planned to, so that not all the bacteria were affected--which would effectively split the populations sampled in each set into "infected" and "uninfected", and presumably they would have different length distributions. Can I test for this before I repeat my experiment (I plan to do that anyway, but still want to know if my findings are significant at this point)?
Googling taught me that Hartigan's Dip Test is what I need. A stranger kindly posted Matlab functions for the test.
Problem: I have Matlab installed on my computer, but I have never used it and I have no idea what to do at this point. I have a column of values in Excel, and even if I manage to enter it as a set in Matlab (although my attempts so far don't look good), I don't know how to run the test. If you can show me how to run it, how would I interpret the results so that they would be meaningful to me (provided I get Matlab to print them)? Please, please, can you help me with that?
Do you have any other suggestions for what to do to my data (remove outliers? how?) or how to look at it in order to see what's going on? (I can post the dataset somewhere if necessary.) Thanks!
I have no statistics skills whatsoever, and I would love to have this figured out before I return to campus tomorrow morning. Please help!
Background: I have several sets of 70 numbers each (they represent the lengths of bacterial cells infected with different phages). I want to show that there is a significant difference between the two sets. While my averages look very good, adding error bars negates my findings. For example:
Set 1: Average 6.83, Standard deviation 1.67.
Set 2: Average 4.1, Standard deviation 1.00.
Possible explanation: Let's look at one set at a time. There is a chance that I infected my bacteria with less phages than I planned to, so that not all the bacteria were affected--which would effectively split the populations sampled in each set into "infected" and "uninfected", and presumably they would have different length distributions. Can I test for this before I repeat my experiment (I plan to do that anyway, but still want to know if my findings are significant at this point)?
Googling taught me that Hartigan's Dip Test is what I need. A stranger kindly posted Matlab functions for the test.
Problem: I have Matlab installed on my computer, but I have never used it and I have no idea what to do at this point. I have a column of values in Excel, and even if I manage to enter it as a set in Matlab (although my attempts so far don't look good), I don't know how to run the test. If you can show me how to run it, how would I interpret the results so that they would be meaningful to me (provided I get Matlab to print them)? Please, please, can you help me with that?
Do you have any other suggestions for what to do to my data (remove outliers? how?) or how to look at it in order to see what's going on? (I can post the dataset somewhere if necessary.) Thanks!
Best answer: Have you actually plotted out and looked at your data? Histogram it in Excel, for instance? Does it look bimodal?
posted by kickingtheground at 8:04 PM on September 23, 2008
posted by kickingtheground at 8:04 PM on September 23, 2008
Response by poster: Whoa! How didn't I think of that?
posted by halogen at 8:08 PM on September 23, 2008
posted by halogen at 8:08 PM on September 23, 2008
Significant difference in 2 sets = t-test
t = (mean1 - mean2)/sqrt(var1/n1 + var2/n2)
t = (6.83 - 4.1)/sqrt(1.67^2/70 + 1^2/70) = 2.73/0.23 = 11.73.
11.73 is significant at whatever level you want.
posted by milkrate at 8:11 PM on September 23, 2008
t = (mean1 - mean2)/sqrt(var1/n1 + var2/n2)
t = (6.83 - 4.1)/sqrt(1.67^2/70 + 1^2/70) = 2.73/0.23 = 11.73.
11.73 is significant at whatever level you want.
posted by milkrate at 8:11 PM on September 23, 2008
Asking whether the data are bimodal is not really the right question, because the already are already separated into two sets. What you want is to see whether there is a significant difference between the two sets, and the way to do that is with a t-test, as milkrate demonstrates.
Do not use Excel for quantitative statistics, although I presume it should be okay for a histogram.
posted by grouse at 8:30 PM on September 23, 2008
Do not use Excel for quantitative statistics, although I presume it should be okay for a histogram.
posted by grouse at 8:30 PM on September 23, 2008
Response by poster: Oh, yes, I already ran t-test and p looks great (3E23). Actually, while working on the histogram I suggested, I think I figured out my error bar problem: instead of entering standard deviation divided by square root of number of samples, I just entered the standard deviation. Suddenly things look a whole lot better!
Thanks for the help--I suspected I wasn't approaching this correctly in the first place.
/me feels embarrassingly stupid, orders a Biostatistics book on Amazon.
posted by halogen at 8:40 PM on September 23, 2008
Thanks for the help--I suspected I wasn't approaching this correctly in the first place.
/me feels embarrassingly stupid, orders a Biostatistics book on Amazon.
posted by halogen at 8:40 PM on September 23, 2008
Response by poster: grouse, I meant to test whether the data are bimodal within each set, since some of the bacteria on each slide may not have been infected at all, and would have a different length distribution (closer to my control set) than the one that were.
posted by halogen at 8:43 PM on September 23, 2008
posted by halogen at 8:43 PM on September 23, 2008
grouse - those complaints are about Excel 2000 - but they are still there in Excel 2007.
posted by blahblahblah at 8:49 PM on September 23, 2008 [1 favorite]
posted by blahblahblah at 8:49 PM on September 23, 2008 [1 favorite]
don't know how much statistics you do, but if you plan to do this sort of thing on a regular basis, a little R goes a long way.
/me feels embarrassingly stupid, orders a Biostatistics book on Amazon.
No reason to feel stupid. I do this sort of number crunching all the time and still find it hard to wrap my head around some days. Statistics are hard and often non-intuitive, and there's no shame in asking for help.
posted by chrisamiller at 8:50 PM on September 23, 2008
/me feels embarrassingly stupid, orders a Biostatistics book on Amazon.
No reason to feel stupid. I do this sort of number crunching all the time and still find it hard to wrap my head around some days. Statistics are hard and often non-intuitive, and there's no shame in asking for help.
posted by chrisamiller at 8:50 PM on September 23, 2008
Oh, sorry about that, halogen; I should have read your question more carefully.
In my work, I would use a histogram or density plot to try to find bimodality. Seems like you are already done, but if you really want to do a dip test, there is an R package that allegedly does one.
posted by grouse at 9:06 PM on September 23, 2008
In my work, I would use a histogram or density plot to try to find bimodality. Seems like you are already done, but if you really want to do a dip test, there is an R package that allegedly does one.
posted by grouse at 9:06 PM on September 23, 2008
Response by poster: I have, in the past, written my own simple scripts in python when it came to statistics, but thankfully, I don't have to do this sort of data crunching very often: I am more used to having to answer questions along the lines of "did I get a mutant or not?". I will definitely look into R--anything that keeps me from having to deal with Matlab, which is hideous on Linux anyway. Thanks everybody!
posted by halogen at 9:25 PM on September 23, 2008
posted by halogen at 9:25 PM on September 23, 2008
You've probably already found it, but there are python bindings for R.
posted by i_am_a_Jedi at 5:38 AM on September 24, 2008
posted by i_am_a_Jedi at 5:38 AM on September 24, 2008
This thread is closed to new comments.
histogram in excel
histogram in matlab
if you're struggling with the computational aspect of it, i can't imagine it would take you longer than 10 minutes to count 70 data points into some bins by hand.
posted by sergeant sandwich at 8:04 PM on September 23, 2008