Statistics Filter: Permutations and large data sets
November 12, 2008 2:28 PM   Subscribe

I have 200 bins, with 1 million data points in each. Each data point can have a value from zero to 10, and we can assume that they're normally distributed. If I calculate a sum by drawing one random data point from each bin and adding them, what value does that sum need to be before I can say that it's higher than 95% of the other possible sums (with reasonable probability)?

The brute-force way to do this is to calculate all possible sums, sort them, then find the value 95% of the way through the list. Obviously, this won't work, since the number of permutations is astronomical. So what's the appropriate way to approximate this?

I know basic R, along with several scripting languages, and have access to a fair amount of computing power, so pointers to code or libraries are welcome as well.
posted by chrisamiller to Computers & Internet (9 answers total)
 
Just checking, are they normally distributed or uniformly distributed? I would expect uniformly from the other stuff in your question, but you specify normal distribution. If it truly is normally distributed, I think we'll need to know the mean and standard deviation for each data point to answer your question. Possibly. I'm only halfway through my first stats course.
posted by vytae at 2:39 PM on November 12, 2008


Response by poster: vytae: Within each sample, the points are normally distributed.
posted by chrisamiller at 2:41 PM on November 12, 2008


You can treat this as the sum of 200 independent, identically distributed random variables. It's pretty easy mathematically, but you need to get your assumptions worked out--if the bins are normally distributed, they can take on any value on the real line, so your [0,10] bounds don't hold. Maybe it's a truncated normal?
posted by xbalto at 2:52 PM on November 12, 2008


Response by poster: Hrmm -I may have been overzealous in assuming normality. What if I can't assume a normal distribution for each bin? How would that alter what I needed to do?
posted by chrisamiller at 3:21 PM on November 12, 2008


Response by poster: Mandatory disclaimer - not homework-filter, just trying to wrap my head around some data
posted by chrisamiller at 3:22 PM on November 12, 2008


Best answer: No matter what the distribution, given a big enough sample you can approximate using a normal distribution. (Central Limit Theorem)

If you have a distribution for an individual bin that has mean mu and standard deviation sigma, you can approximate the sum of 200 bins as a normal with mean 200*mu and standard deviation 200*sigma. Then you can just take the 95% point of the distribution of that normal variable. Does that make sense? You'll still have to make some assumptions about the distribution of the bins, but the nice part it is that it'll be approximately normal no matter what.
posted by xbalto at 3:43 PM on November 12, 2008


This is a straightforward sampling distribution application. You want the sampling distribution of the sum of 200 i.i.d. variables. The mean of this sampling distribution will be the sum of the means of the bins, and the variance of this sampling distribution will be the sum of the variances of the bins. Calculate your 95% level accordingly.
posted by shadow vector at 3:47 PM on November 12, 2008 [1 favorite]


Whoops, my last comment should say variance where it says "standard deviation".
posted by xbalto at 4:09 PM on November 12, 2008


I take back "i.i.d.," they don't need to be identically distributed. Sum the means, sum the variances, you've got your sampling distribution regardless of the shape of the distributions of the bins.
posted by shadow vector at 4:15 PM on November 12, 2008


« Older Approach this question with a grain of salt. Or...   |   Cab Fare to Berwyn, PA? Newer »
This thread is closed to new comments.