What is the expected value function of the Gumbel distribution?
January 30, 2013 6:22 AM   Subscribe

I believe that I need to use the expected value function of the Gumbel distribution to analyze some data. However, I can't find a guide of how to do this that is not written for mathematicians, beyond answers like a(n) + gamma*b(n), which isn't exactly helpful if a(n) and b(n) are not clearly defined. Or, do I even need to calculate this?

Here is my situation. I am good at mathematics, but I am not a mathematician. This means that I only half know what I am talking about.

I have 20 different populations of cells. For each population, I have surveyed 3 different numbers of cells (think 10 cells, 1000 cells, and 100000 cells) and recorded their maximum score for a trait. I only have this maximum score for the three different numbers of cells. Nothing else is known about these populations.

I want to see how these 20 populations differ from each other. My plan was to look at the linear regression of the number of cells on these maximum values. I could then compare the slope of the linear regression to see how the populations differ.

However, this function isn't expected to be linear. So, using a linear regression seems... unwise. As I am working with the maximum value of the trait, I believe that I need to use extreme value theory. I am working under the assumption that the distribution of the trait is a normal distribution, so this points me towards the Gumbel distribution to find maximums.

Therefore, by plotting the number of cells and the maximum trait values, I am really looking at the number of cells and the expected value of the Gumbel distribution.

What is the equation that links these two things?

I realize that this equation should rely on the mean and variance of the trait distribution (which I am assuming is normally distributed). So, by fitting my actual data with this equation, I can estimate the mean and standard deviation of the trait distribution. Then, my plan is to use these estimates to compare the 20 different populations.

I realize that '3 different numbers of cells' is an insanely low number of data points to estimate parameters on. But, I am looking for incredibly large effects, so large errors should not be a problem.

Does my logic make sense, and which equation am I looking for?

Thank you!
posted by Peter Petridish to Science & Nature (6 answers total) 1 user marked this as a favorite
 
Do you have matlab access?

You can make this incredibly easier by using the built in "gevfit" to fit the data to a Generalized Extreme Value distribution, which encompasses a Gumbel distribution. The built in algorithm uses Maximum Likelihood Estimation to parameterize a dataset.
posted by oceanjesse at 6:57 AM on January 30, 2013


Also, you might want to run your basic idea by some statistically sophisticated colleagues. You might be able to transform your data to be linear. It could be the case that the log of the trait is linear or log(trait/n) is linear. Also, what are you doing this in? If excel you have one strategy, in R or Matlab another.

Here is the function, BTW.
posted by shothotbot at 7:01 AM on January 30, 2013


Response by poster: I have easy access to R. I can ask my co-workers to quickly run things in Matlab, but that's not ideal if I want to play around with the data in the future.

The Wolfram Alpha definition of the function has variables 'a' and 'b', for 'location' and 'scale' respectively... but what are these numbers?

Would using a Maximum Likelihood Estimation allow me to estimate the variance and mean of the underlying trait distribution, or would it just estimate 'a' and 'b'?

I can transform the data to be linear, but would that help? For example, increasing the number of cells from 10 to 100 should lead to a huge increase in the maximum value of the trait, much more than increasing the number of cells from 10^7 to 10^8. Log transformations would make the data linear, but it ignores this idea of extreme value theory (of course, unless I am not understanding it correctly, which is a definite possibility.).
posted by Peter Petridish at 7:38 AM on January 30, 2013


Hi,

It sounds like what you really want to do is parameter estimation: given my observations of the maxes for each population, what are my population parameters likely to be?

Here's how you might approach this if you want to to, say, maximum likelihood estimation. The log-likelihood function for the max m of n observations is:

ll(m,n;mu,sigma) = log(phi(x;mu,sigma)+ (n-1)*log(Phi(x;mu,sigma)).

(Does MeFi do latex, by the way? This is terrible.) Anyway, take partials with respect to mu and sigma, and (optimally) solve for dll/dmu = dll/dsigma = 0. Or (more likely) try to optimize the log likelihood numerically if that doesn't work out.

Dealing with the multiple observations is a little bit trickier. I would start by treating the observations as independent, and getting three separate estimates per population. (Hopefully they will agree!) In reality, they are not strictly independent, and they are not all equally important observations. I have some ideas about that, but it's probably better to start with the simplest thing that could possibly work. Feel free to send me a message if you are stuck.
posted by lambdaphage at 9:33 AM on January 30, 2013


Maximum Likelihood Estimation will find optimal a and b. Once you have the distribution, you can find the variance and mean of the parameterized distribution. With Matlab, you can use gevstat for this. I am sure that all of this stuff exists in R, I’m just giving you what I know.
posted by oceanjesse at 1:12 PM on January 31, 2013


Response by poster: Thanks! This gives me another place to start! My plan is to look up the functions in Matlab, read up on them, use my co-worker's computer for this, but at the same time figure out the equivalents in R. Thank you everyone!
posted by Peter Petridish at 10:29 AM on February 2, 2013


« Older Venice vacation part deux: Is going in August a...   |   Banking in US for an International business Newer »
This thread is closed to new comments.