January 18, 2008 7:20 PM Subscribe

for the statisticians: what probability distribution do I want to fit to my data, and how?

I have a collection of widgets. they age, and as they age some of them get tweaked. it's been suggested to me that given enough time every widget will eventually be tweaked, but I'm not convinced. the data I have covers ~64000 widgets, and for each widget is either the age at which it was tweaked, or alternatively the current age and the fact that it has never (yet) been tweaked.

as I understand it (I'm a database programmer with an interest in maths but very little working stats) I want to fit some probability distribution to my data and see whether the area under the probability mass function is < 1.

the 2 things that make this different from everything I've been able to find on the 'net are

1) I specifically don't expect the cumulative distribution to asymptote to 1, which appears to count out obvious candidates like a Poisson distribution

2) my data contains negative examples - those widgets which have not yet been tweaked, and may never be. I'm assuming these are relevant to the shape of the final distribution, but I can't work out how to get tools like R to take them into account.

the best advice I've had so far is from an engineer friend who said to chuck the data on only the tweaked widgets into R, fit a Poisson distribution and be done with it. unfortunately this fails on both of the above points. (then again, all he normally cares about is failure rates in aircraft parts, and you know that given enough time every lug*will* eventually break).

so, if there are any stats geeks who understand what I'm trying to do and can point me at tools/docs/info on how to do it you'd rock my world.
posted by russm to Science & Nature (17 answers total) 2 users marked this as a favorite

I have a collection of widgets. they age, and as they age some of them get tweaked. it's been suggested to me that given enough time every widget will eventually be tweaked, but I'm not convinced. the data I have covers ~64000 widgets, and for each widget is either the age at which it was tweaked, or alternatively the current age and the fact that it has never (yet) been tweaked.

as I understand it (I'm a database programmer with an interest in maths but very little working stats) I want to fit some probability distribution to my data and see whether the area under the probability mass function is < 1.

the 2 things that make this different from everything I've been able to find on the 'net are

1) I specifically don't expect the cumulative distribution to asymptote to 1, which appears to count out obvious candidates like a Poisson distribution

2) my data contains negative examples - those widgets which have not yet been tweaked, and may never be. I'm assuming these are relevant to the shape of the final distribution, but I can't work out how to get tools like R to take them into account.

the best advice I've had so far is from an engineer friend who said to chuck the data on only the tweaked widgets into R, fit a Poisson distribution and be done with it. unfortunately this fails on both of the above points. (then again, all he normally cares about is failure rates in aircraft parts, and you know that given enough time every lug

so, if there are any stats geeks who understand what I'm trying to do and can point me at tools/docs/info on how to do it you'd rock my world.

This is a very interesting problem. There is a wide literature on right censoring that I am not terribly familiar with, so I will try to describe how I would handle this problem. Lets adopt some notation: we have n widgets, a_n is the age of the widget and b_n is the time at which it had been tweaked.

Lets first assume that every widget will be tweaked, but some just haven't been tweaked yet. IE, every widget has an underlying b_n, but if it's greater than a_n, it's unknown. If you think that some widgets will never be tweaked, just pretend that the corresponding b_n is infinity.

Lets write down the probability of the data we observed. The term for each widget for which b_n <>

Now, we can take any probability model you like and write down the likelihood according to it. For example, for Poissons,

P(tweak at time b_n) = exp(-lambda) lambda^b_n / b_n!

P(tweak later than b_n) is sum_(x>b_n) exp(-lambda) lambda^x / x!

Or, if we assume an exponential distribution

P(tweak at time b_n) = lambda exp(-lambda * b_n)

P(tweak after b_n) = exp (-lambda * b_n)

Or, if we assume that each widget has a probability p of being tweaked, otherwise it's not tweaked, and if it does happen to be tweaked, the time follows the exponential distribution,

P(tweak at time b_n) = p * lambda * exp(-lambda * b_n)

P(tweak after b_n) = 1-p + p * exp(-lambda * b_n)

The likelihood of the observed data is just the product of the per-widget likelihoods, so you can write it down as a function of your parameters, and estimate them in some way, such as Maximum Likelihood Estimation. The specifics of how to do that depend on the model; sometimes it's as easy as taking the derivative and setting it to zero; at other times, more complex methods like EM are required.

Sadly, this doesn't seem like a R one-liner, though you can probably write the likelihood functions pretty simply in terms of R primitives and them call a multidimensional optimization method (optim, I think) to find the minimum. Note that all these approaches are quite parametric and sensitive to the specific choice of model -- but otherwise, you have the philosophical problem of "how do I know if a widget will be tweaked in 20 years or never, if the oldest widget I've seen is only 5 years old."

posted by bsdfish at 8:05 PM on January 18, 2008

Lets first assume that every widget will be tweaked, but some just haven't been tweaked yet. IE, every widget has an underlying b_n, but if it's greater than a_n, it's unknown. If you think that some widgets will never be tweaked, just pretend that the corresponding b_n is infinity.

Lets write down the probability of the data we observed. The term for each widget for which b_n <>

Now, we can take any probability model you like and write down the likelihood according to it. For example, for Poissons,

P(tweak at time b_n) = exp(-lambda) lambda^b_n / b_n!

P(tweak later than b_n) is sum_(x>b_n) exp(-lambda) lambda^x / x!

Or, if we assume an exponential distribution

P(tweak at time b_n) = lambda exp(-lambda * b_n)

P(tweak after b_n) = exp (-lambda * b_n)

Or, if we assume that each widget has a probability p of being tweaked, otherwise it's not tweaked, and if it does happen to be tweaked, the time follows the exponential distribution,

P(tweak at time b_n) = p * lambda * exp(-lambda * b_n)

P(tweak after b_n) = 1-p + p * exp(-lambda * b_n)

The likelihood of the observed data is just the product of the per-widget likelihoods, so you can write it down as a function of your parameters, and estimate them in some way, such as Maximum Likelihood Estimation. The specifics of how to do that depend on the model; sometimes it's as easy as taking the derivative and setting it to zero; at other times, more complex methods like EM are required.

Sadly, this doesn't seem like a R one-liner, though you can probably write the likelihood functions pretty simply in terms of R primitives and them call a multidimensional optimization method (optim, I think) to find the minimum. Note that all these approaches are quite parametric and sensitive to the specific choice of model -- but otherwise, you have the philosophical problem of "how do I know if a widget will be tweaked in 20 years or never, if the oldest widget I've seen is only 5 years old."

posted by bsdfish at 8:05 PM on January 18, 2008

Err, I had a brain fart -- you certainly don't want to use a poisson distribution to model the time until tweaking. I'll try writing some R code and posting it in a bit.

posted by bsdfish at 8:08 PM on January 18, 2008

posted by bsdfish at 8:08 PM on January 18, 2008

What's your real goal and data? What are you trying to figure out?

I mean, do you *really* just want to know whether every widget will eventually be tweaked? If you do, this is probably not a matter for statistics.

If what you really want is to model how long it takes for a widget to get tweaked, this is a standard "duration model," "survival model," or "event history analysis." There are several flavors of this, such as proportional-hazard models. They can generally deal with some degree of right-censoring.

posted by ROU_Xenophobe at 8:12 PM on January 18, 2008

I mean, do you *really* just want to know whether every widget will eventually be tweaked? If you do, this is probably not a matter for statistics.

If what you really want is to model how long it takes for a widget to get tweaked, this is a standard "duration model," "survival model," or "event history analysis." There are several flavors of this, such as proportional-hazard models. They can generally deal with some degree of right-censoring.

posted by ROU_Xenophobe at 8:12 PM on January 18, 2008

I don't think you're going to be able to find out if some of the widgets are never going to be tweaked by fitting the data to a probability distribution. The very definition of a probability distribution requires that the cumulative distribution reach 1 eventually. Sorry this reply is not more useful but I just think you're barking up the wrong tree here.

posted by peacheater at 8:18 PM on January 18, 2008

posted by peacheater at 8:18 PM on January 18, 2008

Here's the R code as promised. Note that it usually recovers the parameters fairly well, but because the log-likelihood function isn't convex, the optimization sometimes finds a crappy minimum.

n = 1000;

p = 0.2; # Probability that it's tweaked

b = rexp(n) # Randomly choose tweaking times.

a = rexp(n) + 1; # Randomly choose observation ages.

d = rbinom(n, 1, p);

b[b > a] = Inf; # Block out the unobserved ones

b[!d] = Inf; # Block out ones that will never be tweaked

# Write down the likelihood function

prob = function(params)

{ lambda = params[1]; p = params[2]

-sum(log(p * dexp(b[b <> }

# Note, this likelihood function isn't convex, so EM is probably

# better, but this is simpler.

optim(c(0.5, 0.5),prob, method="L-BFGS",lower = c(1e-5, 1e-5), upper = c(Inf, 1-1e-5))

posted by bsdfish at 8:32 PM on January 18, 2008

n = 1000;

p = 0.2; # Probability that it's tweaked

b = rexp(n) # Randomly choose tweaking times.

a = rexp(n) + 1; # Randomly choose observation ages.

d = rbinom(n, 1, p);

b[b > a] = Inf; # Block out the unobserved ones

b[!d] = Inf; # Block out ones that will never be tweaked

# Write down the likelihood function

prob = function(params)

{ lambda = params[1]; p = params[2]

-sum(log(p * dexp(b[b <> }

# Note, this likelihood function isn't convex, so EM is probably

# better, but this is simpler.

optim(c(0.5, 0.5),prob, method="L-BFGS",lower = c(1e-5, 1e-5), upper = c(Inf, 1-1e-5))

posted by bsdfish at 8:32 PM on January 18, 2008

ROU_Xenophobe - as far as modelling what's happening, what difference does it make? I believe my approach to the problem is correct, but as I said I have very little working stats knowledge... but anyway my data are MeFi threads, and (sometimes) they get Godwined. I saw this a while ago, thought it would be interesting to do something with slightly more rigour, and MeFi seemed an obvious data set... I've collected threads 2800 to 66599 and extracted the thread comment # at which each unique word first appears in the thread... as a rough model of Godwining I'll take the first comment in which hitler/nazi/gestapo/fascist/whatever appears... the plan is to break the results down by thread tags as well, and perhaps try the same for some non-nazi words that might fill a similar role in MeFi threads... "pancakes" springs to mind as a contender...

langedon - thanks, that's a term I hadn't found but you're right it's what I meant...

bsdfish - much obliged, I'll give that a go...

posted by russm at 9:17 PM on January 18, 2008

langedon - thanks, that's a term I hadn't found but you're right it's what I meant...

bsdfish - much obliged, I'll give that a go...

posted by russm at 9:17 PM on January 18, 2008

bsdfish's example assumes a normal distribution. Do not assume this. Writing it for a weird distribution would be hard if not impossible, I would keep this out of statistics.

posted by geoff. at 9:49 PM on January 18, 2008

posted by geoff. at 9:49 PM on January 18, 2008

How to do something depends on what you want to do. People asking these how-to-do-stats questions have an annoying tendency to obfuscate what they're really trying to do and getting bad advice about it.

If you mean, "Do all mefi threads get godwinned?" then the answer has nothing to do with statistics. The answer is "No." This thread doesn't, and it never will because it's closed. That is your final answer. There is no probability -- that one didn't, so the probability that all threads get eventually godwinned is exactly zero.

If you mean, "What determines how long it takes for someone to invoke Hitler, if they ever do?" then there are a variety of duration models for you to use.

posted by ROU_Xenophobe at 10:18 PM on January 18, 2008

(OP here, on a friend's account as login.metafilter doesn't seem to want to talk to me)

ROU - well of course once a thread has been closed it can't be Godwined, but then again I've never seen the length of an online discussion approach infinity so the whole thing isn't exactly going to be statistically rigorous is it? given the context it's not really anything but a gag... jeez...

posted by pompomtom at 1:59 AM on January 19, 2008

ROU - well of course once a thread has been closed it can't be Godwined, but then again I've never seen the length of an online discussion approach infinity so the whole thing isn't exactly going to be statistically rigorous is it? given the context it's not really anything but a gag... jeez...

posted by pompomtom at 1:59 AM on January 19, 2008

You might take a look at this wiki entry on pearson forms of non normal distribution, I think it might help you formulate your data collection and analysis

posted by ptm at 6:24 AM on January 19, 2008

posted by ptm at 6:24 AM on January 19, 2008

OK. for those who've said "stats isn't the way to model this", do you have any suggestions for what *is* the way to do it? or are you just saying "give up"?

posted by russm at 7:41 AM on January 19, 2008

posted by russm at 7:41 AM on January 19, 2008

If you want to know whether all threads get godwinned, then you don't need statistics. You describe the data-generating process. If it's not a logical necessity that all threads be godwinned, then not all threads will be godwinned given a large enough sample.

For something that is basically a geeky gag and where the goal seems to be to ask whether all threads eventually get godwinned, I would do one of the following, neither of which are actually well-considered:

(1) Create a dependent variable GODWIN that takes 1 if a thread is godwinned and 0 otherwise. Create an exposure variable that's the number of comments in a thread. Do a probit or logit using exposure to explain godwin. Does higher exposure = longer threads correlate to a higher probability of godwinning? Since the answer is almost certainly, you can also find the thread length at which the probability of godwinning is 0.99999 or higher; logit is easier to do this for since the logit equation is far easier than the cumulative normal.

(2) Create a dependent variable that's the number of times that a thread is godwinned, and an exposure variable. Do an event-count model (poisson regression or negative binomial regression) using exposure to explain the count of godwinnings.

Logit, probit, poisson regression, and negative binomial regression are all probably already canned in R.

posted by ROU_Xenophobe at 8:42 AM on January 19, 2008

For something that is basically a geeky gag and where the goal seems to be to ask whether all threads eventually get godwinned, I would do one of the following, neither of which are actually well-considered:

(1) Create a dependent variable GODWIN that takes 1 if a thread is godwinned and 0 otherwise. Create an exposure variable that's the number of comments in a thread. Do a probit or logit using exposure to explain godwin. Does higher exposure = longer threads correlate to a higher probability of godwinning? Since the answer is almost certainly, you can also find the thread length at which the probability of godwinning is 0.99999 or higher; logit is easier to do this for since the logit equation is far easier than the cumulative normal.

(2) Create a dependent variable that's the number of times that a thread is godwinned, and an exposure variable. Do an event-count model (poisson regression or negative binomial regression) using exposure to explain the count of godwinnings.

Logit, probit, poisson regression, and negative binomial regression are all probably already canned in R.

posted by ROU_Xenophobe at 8:42 AM on January 19, 2008

thank you. it seems I have some more study to do, and might even come out of this having expanded my knowledge a little. which was, after all, the actual goal of the exercise - pick a toy problem that looks interesting and use it as a vehicle to do some learning.

posted by russm at 9:27 AM on January 19, 2008

posted by russm at 9:27 AM on January 19, 2008

Err, nothing I have stated assumed anything like the normal distribution. In fact, even in the examples I provided, I used a binomial mixture and an exponential delay time until tweaking. However, it's true -- what I described is basically a family of algorithms for fitting different statistical distributions to your data, and the choice of distribution is very significant for the results you will observe.

I think the advice to keep it out of statistics is misguided as a general principle, though it may be easier and "good enough" for a specific problem.

posted by bsdfish at 3:05 PM on January 19, 2008

I think the advice to keep it out of statistics is misguided as a general principle, though it may be easier and "good enough" for a specific problem.

posted by bsdfish at 3:05 PM on January 19, 2008

Yes you are right, I read through them too quickly and thought you were doing something with a log-normal distribution (I couldn't figure out why you'd do this). But you're right ,I was too hasty.

posted by geoff. at 8:26 PM on January 19, 2008

posted by geoff. at 8:26 PM on January 19, 2008

Don't you have two questions, two classes of data, that have to be unpacked?

1) how many widgets suffer an event, out of the total number of widgets

2) for those that did get the event, how long did it take?

posted by wilful at 6:49 PM on January 22, 2008

1) how many widgets suffer an event, out of the total number of widgets

2) for those that did get the event, how long did it take?

posted by wilful at 6:49 PM on January 22, 2008

This thread is closed to new comments.

posted by langedon at 7:52 PM on January 18, 2008