Comments on: How to solve a complex statistics problem with a script?
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script/
Comments on Ask MetaFilter post How to solve a complex statistics problem with a script?Fri, 30 Nov 2012 18:26:42 -0800Fri, 30 Nov 2012 18:34:00 -0800en-ushttp://blogs.law.harvard.edu/tech/rss60Question: How to solve a complex statistics problem with a script?
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script
In this game, you roll a number of six-sided dice to get a <strong>total</strong>. The total is either the highest single die result, or the sum of any multiples rolled, whichever is higher.
For example: If I roll three dice and get a 3, 4, and 6, my total is 6. But if I roll a 4, 4, and 6, my total is 8, the sum of the two 4s.
What I want to find out is the mean, median, mode, and standard deviation of the possible totals given N dice. How might I create a simple script to compute this? <br /><br /> With two or three dice, I can easily figure this out by listing all the possible results, basically by brute force. But that's a pretty labor-intensive method when it comes to four or more dice.<br>
<br>
I have a Macintosh, and I'm comfortable using the Unix command line and several programming languages for simple problems, but I'm not even sure where to start automating something like this. I'd be grateful for any guidance.post:ask.metafilter.com,2012:site.230075Fri, 30 Nov 2012 18:26:42 -0800j0hnpaulmathdicestatisticsprobabilitymeanmedianmodestandarddeviationd6programmingunixbashPHPrubyperljsJavaScriptBy: escabeche
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329557
For large N, it's essentially certain that the score will come from the sum of all the 6's, of which there will be about N/6. So you're going to have a distribution whose median, mean, and mode are all very close to N, and whose standard deviation is on order of sqrt(N).comment:ask.metafilter.com,2012:site.230075-3329557Fri, 30 Nov 2012 18:34:00 -0800escabecheBy: Gosha_Dog
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329578
Try creating a Monte Carlo simulation that will run a large number of trials. This can be done in a spreadsheet program like Excel where you can use the rand() function to generate random values to simulate the dice throws. from there, write a script to compare the values of each die. compute a score for each trial. with a large sample of trials and computed scores, you can then determine with confidence the statistical values you're looking for a given value of n. maybe if you run the Monte Carlo simulations for a few values of n, then perhaps a general formula for your statistical values could be inferred.comment:ask.metafilter.com,2012:site.230075-3329578Fri, 30 Nov 2012 19:06:32 -0800Gosha_DogBy: fings
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329579
What happens with multiple multiples? What if I roll 1, 2, 2, 2, 3, 4, 4, 5?<br>
<br>
Do I take 2+2+2=6, or 4+4=8? Or does "sum of any multiples" mean I get 2+2+2+4+4=14?comment:ask.metafilter.com,2012:site.230075-3329579Fri, 30 Nov 2012 19:06:57 -0800fingsBy: j0hnpaul
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329586
@fings, that's a shrewd question, and I'm sorry I wasn't clear. I should have said "the highest sum of any dice that show the same number". So if you roll 1, 2, 2, 2, 3, 4, 4, and 5, you would combine the 4s and your total would be 8.<br>
<br>
I also realize that the open-ended way I asked the question might make the solution harder than necessary. I'm most interested in the results for rolling three to nine dice, not fifty or a hundred.comment:ask.metafilter.com,2012:site.230075-3329586Fri, 30 Nov 2012 19:16:47 -0800j0hnpaulBy: Monsieur Caution
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329587
Off the top of my head, I've written a Monte Carlo simulator for this <a href="http://jsbin.com/isupix/1/edit">here</a>.<br>
<br>
I haven't put any effort into debugging it, or handling fing's question both ways, or calculating the stats you want in the end, mostly because I think escabeche's point is very relevant, and I suspect this mechanic would not be very useful in a game.comment:ask.metafilter.com,2012:site.230075-3329587Fri, 30 Nov 2012 19:16:54 -0800Monsieur CautionBy: BungaDunga
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329593
<em> I have a Macintosh, and I'm comfortable using the Unix command line and several programming languages for simple problems, but I'm not even sure where to start automating something like this. I'd be grateful for any guidance.</em><br>
<br>
R is not a bad good language for simulating probabilities. You can do something like<br>
<br>
<pre>mean(replicate(1000, throw(n)))</pre><br>
<br>
And it will give you the mean of 1000 calls of throw(n). Of course you need to write a function that throws n dies and calculates the score...comment:ask.metafilter.com,2012:site.230075-3329593Fri, 30 Nov 2012 19:28:53 -0800BungaDungaBy: pla
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329595
For a small number of dice (12D6 only means 2 billion combinations, piece of cake), ignore the Monte Carlo suggestion and just count every possibility.<br>
<br>
For the remainder, I'll use the notation xDy for x dice with y sides each.<br>
<br>
If you only care about a fixed number of dice and number of sides, just use x nested loops each counting to y, trivial to write.<br>
<br>
If you want those variable, use an array of length x.<br>
Initialize it to all 1s.<br>
while the last member remains less than y,<br>
score the current value;<br>
increment the first index;<br>
and propagate propagate rollovers as far into the array as necessary.<br>
<br>
Histogram (or however you want to visualize) your scores. Done.comment:ask.metafilter.com,2012:site.230075-3329595Fri, 30 Nov 2012 19:33:22 -0800plaBy: fings
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329598
So you can simplify your scoring a bit, in that you just want the greatest of sum(all 1s), sum(all 2s), ..., sum(all 6s) for N dice. If there's only 1 six, the sum is 6, and if that's the greatest sum, you want that. You don't need to bother to figure out if you got multiples of something, just sum each value.comment:ask.metafilter.com,2012:site.230075-3329598Fri, 30 Nov 2012 19:37:02 -0800fingsBy: alphanerd
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329599
You can use the perl script below to start you off. I tried it out, and it looks like it calculates the result of each roll accurately from what I can tell, using 6 dice. It prints the roll result, which I guess you could dump into an Excel spreadsheet and then finesse with their statistical functions, or you could use perl to work the resultant array.<br>
<br>
I think the key to coding it is to use a variable for each possible result of the roll, to count how many times that result has occurred.<br>
<br>
#/usr/bin/pl<br>
<br>
@results = ();<br>
<br>
for($a=1;$a<> $max;<br>
}<br>
return $max;<br>
}<br>
</>comment:ask.metafilter.com,2012:site.230075-3329599Fri, 30 Nov 2012 19:38:03 -0800alphanerdBy: alphanerd
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329600
Yikes, that didn't come out at all. Check your memail in a sec.comment:ask.metafilter.com,2012:site.230075-3329600Fri, 30 Nov 2012 19:38:38 -0800alphanerdBy: fings
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329602
I started thinking, what's the minimum you an score with 9 dice? You can't score below a 5: 111122345. Any more 1s, you'd have 5. Any more 2s, you'd score 6. And so on. (111112234 scores the same).comment:ask.metafilter.com,2012:site.230075-3329602Fri, 30 Nov 2012 19:44:41 -0800fingsBy: BungaDunga
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329606
In R, something like this:<br>
<br>
<pre><br>
score <- function(l) <br>
{<br>
y<-c();<br>
for(x in 1:6) {<br>
y=c(y,sum(l[l==x])) <br>
};<br>
max(c(y, max(y)))<br>
};<br>
throw <- function(n) { <br>
score(sample(1:6, n, replace=T))<br>
};<br>
mean(replicate(10000,throw(10)))<br>
</pre><br>
<br>
Throws 10 dice 10,000 times, gives you the mean.comment:ask.metafilter.com,2012:site.230075-3329606Fri, 30 Nov 2012 19:54:31 -0800BungaDungaBy: j0hnpaul
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329684
The comments suggest that this task is above my programming chops. <a href="http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329595">pla's solution</a> sounds like the closest to what I need, but I'm not quite sure how to implement the details.<br>
<br>
My experience with scripting is mostly limited to front-end web design and simple shell scripts, and I thought that would be a firm enough foundation to do this kind of math operation with some guidance, but I've been humbled. I might need a lot more guidance than I thought.comment:ask.metafilter.com,2012:site.230075-3329684Fri, 30 Nov 2012 22:54:37 -0800j0hnpaulBy: kellybird
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329699
this is not complicated if you can write a script. you can do it!! all you need is the following abilities:<br>
<br>
-to make a for loop<br>
-to generate a random number (typically these are uniformly distributed between 0 and 1)<br>
-to compute and store a sum into a variable (addition subtraction etc)<br>
-to make an array (which is like a variable but it has an index, look it up in your programming language guide)<br>
-to write an if-then statement<br>
<br>
that is all you need! no more, no less. i think that anyone with basic programming experience, even super simple, could do this. (including you!) i think it will take you less than an hour.<br>
<br>
---<br>
<br>
here's your pseudocode for the case of 3 dice rolls:<br>
<br>
for i=1:10000 (for loop through the number of random rolls you want to use, 10K is a good number)<br>
<br>
- generate 3 uniform random numbers, call them x1, x2, and x3<br>
<br>
- use those 3 random numbers to get yourself 3 dice rolls as follows:<br>
if the rand number is between 0 and 1/6, you rolled a 1. <br>
if the rand number is between 1/6 and 2/6 you rolled a 2, and so on up to 6. <br>
make these into new variables r1, r2, and r3. for instance:<br>
if x1 >= 0 and x1 < 1 then r1 = 1.<br>
you can do this in a very ugly (but it will work) manner by writing lots of if-then statements.<br>
<br>
- take your dice rolls r1, r2, and r3, and get your total score. you can again just write a whole bunch of if-then statements, nested within each other, to go through all the cases. <br>
for example, if r1=r2 and r3< r1+r2, then my_dice_score=r1+r2.<br>
<br>
- next you will need an array to store all 10,000 of the rolls. almost every programming language has an array structure. this is just a variable with an index, where you can store 10,000 of something without having to create new variable names each time. let's say you've made an array called all_my_sums. then you save the current results into it: <br>
<br>
all_the_sums(i)=my_dice_score; (the index i refers to how far you've gotten into the for loop)<br>
<br>
- end the for loop (which means you do all of the above 10,000 times)<br>
<br>
then you will have a list of all 10,000 scores, stored into into the array all_the_sums. you can compute the mean, median, stdev of this. you can maybe also make a histogram. your programming language of choice probably has all those as one-line commands. there should be something like median(all_the_sums) that just prints the answer to the screen. <br>
<br>
I feel like this is a very simple programming project that you could do in an hour and learn something! you can do it!</1>comment:ask.metafilter.com,2012:site.230075-3329699Sat, 01 Dec 2012 00:05:26 -0800kellybirdBy: kellybird
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329700
if you are daunted, you might start by testing each of the operations alone, before you try to compose them all into a program. <br>
<br>
for instance, generate a random number and print it to the screen.<br>
<br>
make an array with 3 things in it, take the median.<br>
<br>
write a simple for loop that prints out "hello world" each time.<br>
<br>
write the code that gives you three random numbers and three dice rolls from those, and make sure it works.<br>
<br>
practice making an if-then statment. <br>
<br>
etc. <br>
<br>
once you are comfortable doing each individual thing, you can combine them all into a program.<br>
(ps, remember that you want to enumerate ALL the cases, so check if r1=r2, but also if r2=r3, and r1=r3.)<br>
<br>
sorry if this is so simple as to insult your intelligence. i am basically a programming master (by training, anyway, haha) and i go through this exact process whenever i am encountering some new function or have to work my way through a complicated (to me) task.comment:ask.metafilter.com,2012:site.230075-3329700Sat, 01 Dec 2012 00:10:24 -0800kellybirdBy: kellybird
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329703
oops, i made a typo two posts above... when you are computing the dice roll r1 from the uniform random variable x1 it should have said:<br>
<br>
if x1 >= 0 and x1 (is less than 1/6) then r1 = 1 <br>
<br>
instead of<br>
<br>
if x1 >= 0 and x1 < 1 then r1 = 1 <br>
<br>
(metafilter mixed up the formatting with all the greater than and less than signs turning into html etc)comment:ask.metafilter.com,2012:site.230075-3329703Sat, 01 Dec 2012 00:22:06 -0800kellybirdBy: DevilsAdvocate
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329779
Agreed that pla's solution works out quite nicely for the number of dice you're looking at.<br>
<br>
An approach that works well for somewhat larger values of N—probably up to about 120, but depending on how fast your computer is and how long you're willing to wait—would be for the values which are iterated to be the number of dice showing a given value. This is a bit more complex because, unlike pla's algorithm in which each outcome enumerated is equally likely, you have to determine the number of outcomes corresponding to a particular situation. However, that's not too hard—it's just N!/(n1!*n2!*n3!*n4!*n5!*n6!)—where n1 is the number of dice showing a 1, n2 is the number of dice showing a 2, and so forth. This requires just 5 loops (more generally, for xDy, just y-1 loops rather than x loops). The number of possibilities to be enumerated can be shown to be C(N+5,5) or (N+5)!/(N!*5!). For 100d6 that's about 96 million possibilities. Even for smallish numbers of dice this method may be considerably more efficient: for 12d6, you only have to look at 6188 possibilities rather than the 2 billion required by pla's method.<br>
<br>
The basic idea, as pseudocode, is:<br>
<pre><br>
for n1=0 to N<br>
for n2=0 to N-n1<br>
for n3=0 to N-n1-n2<br>
for n4=0 to N-n1-n2-n3<br>
for n5=0 to N-n1-n2-n3-n4<br>
n6=N-n1-n2-n3-n4-n5<br>
calculate result for that combination of dice<br>
calculate number of outcomes for that combination of dice and add to a tabulation<br>
next n5<br>
next n4<br>
next n3<br>
next n2<br>
next n1<br>
calculate required data (mean, median, mode, etc.)<br>
</pre><br>
<br>
A further optimization could be made by inverting the order of the loops—making n6 the outermost loop and n2 the innermost—and after each value is determined, check whether the remaining dice could possibly make a difference, and if not, skip the inner loops. E.g., after determining the number of 6's, calculate the sum of the 6's and ask if it would change the result if all remaining dice were 5's (if at least 5/11 of the dice are showing 6's, it would not), and if not, take the result right then and skip the inner loops. If so, then cycle through possible numbers of dice showing 5's, and for each one calculate the result based on 6's and 5's alone, ask if it would change the result if all remaining dice were 4's, and if not, skip the inner loops, and so forth. Of course, you need to alter the calculation for the number of results when you skip some of the inner loops.comment:ask.metafilter.com,2012:site.230075-3329779Sat, 01 Dec 2012 05:59:06 -0800DevilsAdvocateBy: Obscure Reference
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329787
This sounds like a multinomial distribution. Consider the polynomial (a+b+c+d+e+f)^n where n is the number of dice. Each term is a <a href="http://en.wikipedia.org/wiki/Multinomial_theorem">multinomial coefficient</a> times a product of powers of a through f. Each such term represents an outcome of a throw with the coefficient giving the multiplicity of its occurrence. To calculate the mean, what you want to do is replace each term with its score and divide by 6^n (the total number of possible outcomes which is also the sum of the multinomial coefficients.)<br>
<br>
I could probably say this more clearly if I weren't defeated by the notational task.comment:ask.metafilter.com,2012:site.230075-3329787Sat, 01 Dec 2012 06:22:34 -0800Obscure ReferenceBy: kellybird
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329862
obscure reference - it's not a multinomial distribution because you are taking maxima, and because of the weird rules for getting the total score of the roll based on the 3 dice. there isn't a simple closed form here. it can probably be derived with a few pages of algebra and looking up results about maximum statistics of integers. however that is probably beyond the scope of what the poster wants to do.comment:ask.metafilter.com,2012:site.230075-3329862Sat, 01 Dec 2012 08:15:28 -0800kellybirdBy: Obscure Reference
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329872
The maximum and the weird rules would be accounted for in replacing each term with its score. E.g. coef * a^3 * b^2 * e * f would be counted as the maximum of 3*a, 2*b, e, and f.<br>
where a, b, e, and f are taken from the numbers 1 through 6 and n = 3 + 2 + 1 + 1 = 7 in this example.comment:ask.metafilter.com,2012:site.230075-3329872Sat, 01 Dec 2012 08:37:34 -0800Obscure ReferenceBy: one_bean
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329904
<a href="http://i.imgur.com/K1Tmy.png">Here you go</a>. BungaDunga's script works perfectly in R. For numbers of dice 1 to 40, the mean can be approximated as 3.54 + 1.08*N (where N is the number of dice), although small numbers of dice will be less than the approximation and large numbers will be slightly higher. Note that the median is somewhat unpredictable, but always pretty close to the mean.comment:ask.metafilter.com,2012:site.230075-3329904Sat, 01 Dec 2012 09:37:21 -0800one_beanBy: escabeche
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329941
Wait, but I am I wrong that the ratio mean/N has to go to 1 in the limit?comment:ask.metafilter.com,2012:site.230075-3329941Sat, 01 Dec 2012 10:41:42 -0800escabecheBy: one_bean
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3329951
It does, but not until you get into the 1000's of dice.comment:ask.metafilter.com,2012:site.230075-3329951Sat, 01 Dec 2012 10:53:56 -0800one_beanBy: madcaptenor
http://ask.metafilter.com/230075/How-to-solve-a-complex-statistics-problem-with-a-script#3330158
<i>Wait, but I am I wrong that the ratio mean/N has to go to 1 in the limit?</i><br>
<br>
In the limit you'll get N/6 sixes, so yes. For example, simulating throwing 10<sup>3</sup> dice 10<sup>5</sup> times I get a mean of 1001.96 and a standard deviation of 67.90 (therefore a standard error of 0.21).<br>
<br>
To get the asymptotics you can probably figure that if the result isn't coming from the sixes, it's coming from the fives, and the limiting distribution of the number of fives and sixes is jointly Gaussian.<br>
<br>
Also, there's probably a better way than the naive one of simulating draws from a multinomial distribution, but I don't feel like thinking about that right now. (I literally just threw out all my notes from when I taught probability, because I don't want to drag those around.)comment:ask.metafilter.com,2012:site.230075-3330158Sat, 01 Dec 2012 14:25:23 -0800madcaptenor