Statistics Filter: Permutations and large data sets
November 12, 2008 2:28 PM
Subscribe
I have 200 bins, with 1 million data points in each. Each data point can have a value from zero to 10, and we can assume that they're normally distributed. If I calculate a sum by drawing one random data point from each bin and adding them, what value does that sum need to be before I can say that it's higher than 95% of the other possible sums (with reasonable probability)?
The brute-force way to do this is to calculate all possible sums, sort them, then find the value 95% of the way through the list. Obviously, this won't work, since the number of permutations is astronomical. So what's the appropriate way to approximate this?
I know basic R, along with several scripting languages, and have access to a fair amount of computing power, so pointers to code or libraries are welcome as well.
posted by chrisamiller to computers & internet (9 comments total)
posted by vytae at 2:39 PM on November 12, 2008