How to transform random variables from a non-normal distribution to a normal distribution?
March 30, 2011 2:34 PM   Subscribe

How would I find a function that transforms random variables from an non-normal probability distribution to a normal distribution?

I have a set of scores from a random distribution that is skewed left. Maximum likelihood estimation using the tools from show a likely Beta or Weibull distribution. For background, the scores are inflated ratings for games.

What I'd like to do is normalize the scores. So that given an arbitrary score from the real distribution, the result is a new score from the transformed normal distribution.

What keywords and topics should I search for to learn how to do this? I'm familiar with R and Matlab, so code from there works too.
posted by formless to Science & Nature (9 answers total)
Best answer: you could try a box-cox distribution. I'm not sure how to do it in R, but you could do it in stata with bcskew0 newvarname=variablename
posted by trcook at 2:39 PM on March 30, 2011

Not quite sure I completely get what you're doing, but why not just translate the percentiles? So if a score is the 73% position in one distribution it's 73% into a normal one when translated?
posted by edd at 3:06 PM on March 30, 2011

I may not be understanding what you want to do correctly, but one option is to take the natural log of a set of variables to hammer them into a normalized distribution- this often works. I know this can be done in SPSS under transform variables tab...however I'm unfamiliar with the programs you mentioned.
posted by goodnight moon at 3:34 PM on March 30, 2011

Response by poster: Some additional information about what I'm trying to do.

I have a set of review scores from videogame reviews (mostly magazines and websites). Due to the gaming review system, reviews are inflated. Reviewers who write bad reviews get blacklisted, or lose advertising, etc.

So, given a review score from this skewed distribution, I'd like to spit out a new score that is much more in line with what an honest review would look like.

I'm making the assumption that game quality is normally distributed.

Translating the percentiles works, but I assume I need to find the parameters of the target distribution first, which I'm not sure how to do yet. I'll look into the box-cox distribution to see if that will help.
posted by formless at 3:44 PM on March 30, 2011

Aren't you also assuming linearity, or something like it, in review scores? That is, if Game A scores higher than Game B on your new normal scale, then you're assuming that it got a higher (mean?) review score. That seems like a very strong assumption.
posted by Nomyte at 4:11 PM on March 30, 2011

Not answering your question, but the assumption that game quality is normally distributed seems a bit tenuous. If you just report the percentile for each game you don't need to make that assumption.

Perhaps you believe that people have an easier time understanding normally distributed stats. In that case, you need to apply the inverse of the c.d.f (cumulative distribution function) to the percentile score (where percentile score is between 0.0 and 1.0. i.e. it's equal to (number of reviews <>
Sorry, don't know Matlab or R, just the theory.

[You can actually use the inverse c.d.f. to transform percentiles into any desired distribution, not just normal. Though you're more likely to find inverse c.d.f. implemented for a normal distribution than for any other one.]
posted by benito.strauss at 4:39 PM on March 30, 2011 [1 favorite]

The National Institutes of Health has the same problem with grant application scores. For the applications that can be easily comparable, their solution is to go with percentiles, as benito.strauss suggests. Trying to map this to a normal distribution is pretty goofy.
posted by grouse at 4:43 PM on March 30, 2011

Best answer: I came in here to suggest the Box-Cox transform as well. Box-Cox is kind of the catch-all transformation when you have a data set for which it isn't immediately obvious that you should be doing a log-, square root-, or arcsin-transform (to name a few).

At least, that's my recommendation if you have your heart set on a transform to the normal distribution. I think it would be more straightforward to just report the percentiles, as folks have said above.

Are you actually trying to do statistics on these data, or do you just want to report adjusted scores? If you need to use a statistic that assumes the normal distribution, I'd recommend Box-Cox. (Or, don't transform, and just use a nonparametric stat.) If you just want to report adjusted scores, I think the percentile makes the most sense.
posted by pemberkins at 6:16 PM on March 30, 2011 [1 favorite]

Response by poster: Thanks for the help. After examining some additional data from non-industry sources, I'm not quite as confident the quality of games does follow a normal distribution. I'll probably go with reporting adjusted percentiles.
posted by formless at 5:36 PM on March 31, 2011

« Older WrestleMania in San Francisco   |   Is there such thing as a horror film with no evil... Newer »
This thread is closed to new comments.