January 9, 2010 9:30 PM Subscribe

What do I check if some one-variable data fits a distribution? In this case, I plotted my data in excel and it looks like a gamma distribution. How do I check "goodness of fit" like you can with the linear/polynomial/power/exponential regressions that excel has? I also have mathematica at my disposal.

posted by lpctstr; to Science & Nature (8 answers total) 2 users marked this as a favorite

posted by lpctstr; to Science & Nature (8 answers total) 2 users marked this as a favorite

I should add: there are several ways to estimate the shape and scale parameters α and ß of a gamma distribution based on experimental data. The simplest way is to let αß equal the mean and αß/sqrt(α) equal the standard deviation (this is the method of moments). A more accurate approach involves the method of maximum likelihood, but the calculations are more involved. Johnson's *Continuous Univariate Distributions* has a few useful approximations, some of which can also be found online (e.g., Wikipedia's article on gamma distributions).

posted by Mapes at 9:50 PM on January 9, 2010

posted by Mapes at 9:50 PM on January 9, 2010

As far as I know, Excel doesn't have this functionality on its own. Mathematica may but I can't speak to that... Most statistical packages (R, SAS, STATA, etc...) certainly will.. Not sure how much you care to figure this out, but R is open-source and free (though of course the time to learn it or get help with it is not).

posted by drpynchon at 10:15 PM on January 9, 2010

posted by drpynchon at 10:15 PM on January 9, 2010

Thanks Mapes and drpynchon. I tried the method of moments and somehow got negative parameters -- I'll continue thinking about this. In the meantime, I have access to SPSS at work on Monday so I'll give that a try later.

posted by lpctstr; at 11:53 PM on January 9, 2010

posted by lpctstr; at 11:53 PM on January 9, 2010

I'd make a) a quantile-quantile plot vs the fitted distribution b) overlay the empirical cumulative distribution and the fitted cumulative distribution.

In R, you can use the fitdistr function to get MLEs of the parameters. There's a function of the same same in S-Plus if I recall correctly, but the syntax is different.

posted by a robot made out of meat at 8:11 AM on January 10, 2010

In R, you can use the fitdistr function to get MLEs of the parameters. There's a function of the same same in S-Plus if I recall correctly, but the syntax is different.

posted by a robot made out of meat at 8:11 AM on January 10, 2010

If you did method of moments and got negative parameters something is wrong in your calculation is all. The data must all be positive to be gamma, so the sample average = alpha*beta must be positive, and variances are always positive, so the sample variance = alpha *beta^2 must also be positive; beta = variance / average is therefore positive and alpha = average^2/variance is also positive.

posted by a robot made out of meat at 9:58 AM on January 10, 2010

posted by a robot made out of meat at 9:58 AM on January 10, 2010

robot: You're absolutely right. Though I corrected my error and got an alpha=0.24 and beta=80000 which still doesn't fit right when I plot it. My guess-and-check shows that alpha should be 2.5 and beta about 3.5. Will keep trying

posted by lpctstr; at 1:24 PM on January 10, 2010

posted by lpctstr; at 1:24 PM on January 10, 2010

Here is some Mathematica code to estimate parameters by using the maximum likelihood method, which should give more accurate answers than the method of moments ("Data" is your list of data):

FindRoot[Log[Data/GeometricMean[Data]] == Log[\[Alpha]] - PolyGamma[\[Alpha]], {\[Alpha], 4}]

Beta = Mean[Data]/(Alpha/.%)

R = Log[Mean[Data]/GeometricMean[Data]];

Alpha = (.500876 + .1648852 R - .0544274 R^2)/R

Beta = Mean[Data]/%

s = Log[Total[Data]/Length[Data]] - Total[Log[Data]]/Length[Data];

Alpha = (3 - s + Sqrt[(s - 3)^2 + 24 s])/12/s

Beta = Mean[Data]/%

All three should give roughly the same numbers for the shape and scale parameters. The second and third are just estimates for the first, and can more easily be brought into Excel. All these equations are discussed in Johnson's book that I mentioned above.

The following code produces a probability or quantile-quantile plot that compares expected data to observed data (the statistic medians are discussed here):

Needs["StatisticalPlots`"]

L = Length[Data]

StatisticMedians = Join[{1 - 0.5^(1/L)}, Table[(i - 0.3175)/(L + .365), {i, 2, L - 1}], 0.5^(1/L)}];

QuantilePlot[InverseCDF[GammaDistribution[Alpha, Beta], StatisticMedians], Data]

If your data doesn't approximately follow a straight line with*y* = *x* on the probability plot (where Alpha and Beta are estimated by using one of the methods above), then it doesn't follow a gamma distribution. These plots accentuate the behavior at the tails, where deviations can signal an incorrectly applied distribution.

posted by Mapes at 7:18 AM on January 11, 2010

FindRoot[Log[Data/GeometricMean[Data]] == Log[\[Alpha]] - PolyGamma[\[Alpha]], {\[Alpha], 4}]

Beta = Mean[Data]/(Alpha/.%)

R = Log[Mean[Data]/GeometricMean[Data]];

Alpha = (.500876 + .1648852 R - .0544274 R^2)/R

Beta = Mean[Data]/%

s = Log[Total[Data]/Length[Data]] - Total[Log[Data]]/Length[Data];

Alpha = (3 - s + Sqrt[(s - 3)^2 + 24 s])/12/s

Beta = Mean[Data]/%

All three should give roughly the same numbers for the shape and scale parameters. The second and third are just estimates for the first, and can more easily be brought into Excel. All these equations are discussed in Johnson's book that I mentioned above.

The following code produces a probability or quantile-quantile plot that compares expected data to observed data (the statistic medians are discussed here):

Needs["StatisticalPlots`"]

L = Length[Data]

StatisticMedians = Join[{1 - 0.5^(1/L)}, Table[(i - 0.3175)/(L + .365), {i, 2, L - 1}], 0.5^(1/L)}];

QuantilePlot[InverseCDF[GammaDistribution[Alpha, Beta], StatisticMedians], Data]

If your data doesn't approximately follow a straight line with

posted by Mapes at 7:18 AM on January 11, 2010

This thread is closed to new comments.

posted by Mapes at 9:34 PM on January 9, 2010