Picking the correct probability distribution
January 13, 2011 1:28 AM
Which probability distribution should I use to model examination results?
At the moment I'm using the Beta distribution, but that's mainly because it looks right and is relatively easy to implement in Excel, which is what I use as a markbook. I don't think that the normal distribution is correct because that'd create a symmetric graph and the results are usually biased around one end of the scale, but I wonder about other distributions.
I'm a scientist, so I can understand the maths, but I haven't done a lot of stats work and so I don't know which distribution is appropriate for which situation.
At the moment I'm using the Beta distribution, but that's mainly because it looks right and is relatively easy to implement in Excel, which is what I use as a markbook. I don't think that the normal distribution is correct because that'd create a symmetric graph and the results are usually biased around one end of the scale, but I wonder about other distributions.
I'm a scientist, so I can understand the maths, but I haven't done a lot of stats work and so I don't know which distribution is appropriate for which situation.
What is your goal in matching your exam results to a specific probability distribution? In other words, what kind of statements about the distribution of exam results do you want to make?
Obviously you don't need to model the data with a specific distribution in order to make purely descriptive or summary statements about the data itself: mean, median, mode, variance, skewness, ...
posted by JumpW at 3:31 AM on January 13, 2011
Obviously you don't need to model the data with a specific distribution in order to make purely descriptive or summary statements about the data itself: mean, median, mode, variance, skewness, ...
posted by JumpW at 3:31 AM on January 13, 2011
How about lognormal distribution? It's not symmetric about the mid-point mark and the shape can be controlled by changing the parameters.
posted by coolnik at 3:34 AM on January 13, 2011
posted by coolnik at 3:34 AM on January 13, 2011
I don't think it's possible to say without knowing more about the exam, and like JumpW I'd be happier knowing also what you want to do with the model after - it may be that there are better options than trying to model something that you might not be able to sufficiently well.
posted by edd at 3:38 AM on January 13, 2011
posted by edd at 3:38 AM on January 13, 2011
How many exam takers are you talking about, here? Looking at distributions for classes in the 200-300 student range, I've rarely seen a curve smooth enough for the model to really make any actual difference.
posted by Dr.Enormous at 5:21 AM on January 13, 2011
posted by Dr.Enormous at 5:21 AM on January 13, 2011
What is your goal in matching your exam results to a specific probability distribution? In other words, what kind of statements about the distribution of exam results do you want to make?
I'm really just trying to show the general distribution of results. The obvious choice would be a histogram but it always ends up looking very "blocky".
posted by alby at 6:27 AM on January 13, 2011
I'm really just trying to show the general distribution of results. The obvious choice would be a histogram but it always ends up looking very "blocky".
posted by alby at 6:27 AM on January 13, 2011
How about lognormal distribution? It's not symmetric about the mid-point mark and the shape can be controlled by changing the parameters.
That's really the heart of my question. I have no reason to choose lognormal over Poisson over Beta; what I'm looking for is a reason to pick one particular distribution.
posted by alby at 6:30 AM on January 13, 2011
That's really the heart of my question. I have no reason to choose lognormal over Poisson over Beta; what I'm looking for is a reason to pick one particular distribution.
posted by alby at 6:30 AM on January 13, 2011
Obviously you don't need to model the data with a specific distribution in order to make purely descriptive or summary statements about the data itself: mean, median, mode, variance, skewness, ...
The problem is that explaining standard deviation or kurtosis is pretty difficult with the audience I have. A simple graph with a line on, with a sharp peak showing "most people got close to this mark" or a semicircle-looking line showing a wide distribution of results is much easier for them to grasp.
posted by alby at 6:33 AM on January 13, 2011
The problem is that explaining standard deviation or kurtosis is pretty difficult with the audience I have. A simple graph with a line on, with a sharp peak showing "most people got close to this mark" or a semicircle-looking line showing a wide distribution of results is much easier for them to grasp.
posted by alby at 6:33 AM on January 13, 2011
Why are you modeling the scores? What are you trying to infer from them? How deadly-serious is your application?
I mean, at one extreme you probably shouldn't model scores at all but rather run the individual sets of responses to each question through a Rasch model.
On preview, just do a kernel density. Any competent statistical package, including R, can do this easily. It appears there are bolt-ons for Excel if you really don't want to leave that. But if this is to show students, what they probably want to see is a histogram by grade with the bins labeled by point ranges.
posted by ROU_Xenophobe at 6:33 AM on January 13, 2011
I mean, at one extreme you probably shouldn't model scores at all but rather run the individual sets of responses to each question through a Rasch model.
On preview, just do a kernel density. Any competent statistical package, including R, can do this easily. It appears there are bolt-ons for Excel if you really don't want to leave that. But if this is to show students, what they probably want to see is a histogram by grade with the bins labeled by point ranges.
posted by ROU_Xenophobe at 6:33 AM on January 13, 2011
If the point of this is a plot, the easiest thing to do is use a cumulative probability plot (or cumulative distribution function) or a percentile plot instead of a histogram. Beta is a very flexible 0-1 distribution, and mixtures even more so, so you should be able to produce a density that closely matches your data.
You can make a cumulative distribution plot in excel by sorting the results, making a (1:N)/N index column, and plotting the mark on the x axis and i/N on the y-axis. The interpretation is that at mark x, fraction y of people did that well or worse. Swapping it to that well or better is more intuitive for some people.
posted by a robot made out of meat at 6:39 AM on January 13, 2011
You can make a cumulative distribution plot in excel by sorting the results, making a (1:N)/N index column, and plotting the mark on the x axis and i/N on the y-axis. The interpretation is that at mark x, fraction y of people did that well or worse. Swapping it to that well or better is more intuitive for some people.
posted by a robot made out of meat at 6:39 AM on January 13, 2011
As a student, I always just want to see the histogram. It tells me how many people earned above/below my score, which is what usually mattered. Obviously, tweak the bin widths to improve the appearance.
Are your students failing to conceptualize the data presented in the histogram, so you're looking for something even simpler to show them? Or is this just an issue of aesthetics?
posted by Metasyntactic at 6:57 AM on January 13, 2011
Are your students failing to conceptualize the data presented in the histogram, so you're looking for something even simpler to show them? Or is this just an issue of aesthetics?
posted by Metasyntactic at 6:57 AM on January 13, 2011
...and yeah, seconding a kernel distribution or CDF if you're just wanting a nice smoother plot.
Those don't have a prior over what the distribution should look like, so you won't get bitten in the way you would if you chose a single gaussian but your data winds up having two peaks.
posted by Metasyntactic at 7:10 AM on January 13, 2011
Those don't have a prior over what the distribution should look like, so you won't get bitten in the way you would if you chose a single gaussian but your data winds up having two peaks.
posted by Metasyntactic at 7:10 AM on January 13, 2011
I am not well-versed in stats, but having worked with data visualization for a short time, I was thinking you could get away with scatter plots and control charts. The scatter lot gives you the "distribution" of the data points around an "average" that you can choose.
Choosing the appropriate control chart may also be helpful.
posted by theobserver at 7:44 AM on January 13, 2011
Choosing the appropriate control chart may also be helpful.
posted by theobserver at 7:44 AM on January 13, 2011
This thread is closed to new comments.
posted by alby at 1:29 AM on January 13, 2011