# How to solve a complex statistics problem with a script?November 30, 2012 6:26 PM   Subscribe

In this game, you roll a number of six-sided dice to get a total. The total is either the highest single die result, or the sum of any multiples rolled, whichever is higher. For example: If I roll three dice and get a 3, 4, and 6, my total is 6. But if I roll a 4, 4, and 6, my total is 8, the sum of the two 4s. What I want to find out is the mean, median, mode, and standard deviation of the possible totals given N dice. How might I create a simple script to compute this?

With two or three dice, I can easily figure this out by listing all the possible results, basically by brute force. But that's a pretty labor-intensive method when it comes to four or more dice.

I have a Macintosh, and I'm comfortable using the Unix command line and several programming languages for simple problems, but I'm not even sure where to start automating something like this. I'd be grateful for any guidance.
posted by j0hnpaul to Computers & Internet (24 answers total) 2 users marked this as a favorite

For large N, it's essentially certain that the score will come from the sum of all the 6's, of which there will be about N/6. So you're going to have a distribution whose median, mean, and mode are all very close to N, and whose standard deviation is on order of sqrt(N).
posted by escabeche at 6:34 PM on November 30, 2012 [1 favorite]

Try creating a Monte Carlo simulation that will run a large number of trials. This can be done in a spreadsheet program like Excel where you can use the rand() function to generate random values to simulate the dice throws. from there, write a script to compare the values of each die. compute a score for each trial. with a large sample of trials and computed scores, you can then determine with confidence the statistical values you're looking for a given value of n. maybe if you run the Monte Carlo simulations for a few values of n, then perhaps a general formula for your statistical values could be inferred.
posted by Gosha_Dog at 7:06 PM on November 30, 2012 [1 favorite]

What happens with multiple multiples? What if I roll 1, 2, 2, 2, 3, 4, 4, 5?

Do I take 2+2+2=6, or 4+4=8? Or does "sum of any multiples" mean I get 2+2+2+4+4=14?
posted by fings at 7:06 PM on November 30, 2012 [1 favorite]

@fings, that's a shrewd question, and I'm sorry I wasn't clear. I should have said "the highest sum of any dice that show the same number". So if you roll 1, 2, 2, 2, 3, 4, 4, and 5, you would combine the 4s and your total would be 8.

I also realize that the open-ended way I asked the question might make the solution harder than necessary. I'm most interested in the results for rolling three to nine dice, not fifty or a hundred.
posted by j0hnpaul at 7:16 PM on November 30, 2012

Off the top of my head, I've written a Monte Carlo simulator for this here.

I haven't put any effort into debugging it, or handling fing's question both ways, or calculating the stats you want in the end, mostly because I think escabeche's point is very relevant, and I suspect this mechanic would not be very useful in a game.
posted by Monsieur Caution at 7:16 PM on November 30, 2012

I have a Macintosh, and I'm comfortable using the Unix command line and several programming languages for simple problems, but I'm not even sure where to start automating something like this. I'd be grateful for any guidance.

R is not a bad good language for simulating probabilities. You can do something like
`mean(replicate(1000, throw(n)))`
And it will give you the mean of 1000 calls of throw(n). Of course you need to write a function that throws n dies and calculates the score...
posted by BungaDunga at 7:28 PM on November 30, 2012

For a small number of dice (12D6 only means 2 billion combinations, piece of cake), ignore the Monte Carlo suggestion and just count every possibility.

For the remainder, I'll use the notation xDy for x dice with y sides each.

If you only care about a fixed number of dice and number of sides, just use x nested loops each counting to y, trivial to write.

If you want those variable, use an array of length x.
Initialize it to all 1s.
while the last member remains less than y,
score the current value;
increment the first index;
and propagate propagate rollovers as far into the array as necessary.

Histogram (or however you want to visualize) your scores. Done.
posted by pla at 7:33 PM on November 30, 2012 [6 favorites]

So you can simplify your scoring a bit, in that you just want the greatest of sum(all 1s), sum(all 2s), ..., sum(all 6s) for N dice. If there's only 1 six, the sum is 6, and if that's the greatest sum, you want that. You don't need to bother to figure out if you got multiples of something, just sum each value.
posted by fings at 7:37 PM on November 30, 2012 [2 favorites]

You can use the perl script below to start you off. I tried it out, and it looks like it calculates the result of each roll accurately from what I can tell, using 6 dice. It prints the roll result, which I guess you could dump into an Excel spreadsheet and then finesse with their statistical functions, or you could use perl to work the resultant array.

I think the key to coding it is to use a variable for each possible result of the roll, to count how many times that result has occurred.

#/usr/bin/pl

@results = ();

for(\$a=1;\$a<> \$max;
}
return \$max;
}

posted by alphanerd at 7:38 PM on November 30, 2012

Yikes, that didn't come out at all. Check your memail in a sec.
posted by alphanerd at 7:38 PM on November 30, 2012

I started thinking, what's the minimum you an score with 9 dice? You can't score below a 5: 111122345. Any more 1s, you'd have 5. Any more 2s, you'd score 6. And so on. (111112234 scores the same).
posted by fings at 7:44 PM on November 30, 2012 [1 favorite]

In R, something like this:
```score <- function(l)
{
y<-c();
for(x in 1:6) {
y=c(y,sum(l[l==x]))
};
max(c(y, max(y)))
};
throw <- function(n) {
score(sample(1:6, n, replace=T))
};
mean(replicate(10000,throw(10)))
```
Throws 10 dice 10,000 times, gives you the mean.
posted by BungaDunga at 7:54 PM on November 30, 2012

The comments suggest that this task is above my programming chops. pla's solution sounds like the closest to what I need, but I'm not quite sure how to implement the details.

My experience with scripting is mostly limited to front-end web design and simple shell scripts, and I thought that would be a firm enough foundation to do this kind of math operation with some guidance, but I've been humbled. I might need a lot more guidance than I thought.
posted by j0hnpaul at 10:54 PM on November 30, 2012

this is not complicated if you can write a script. you can do it!! all you need is the following abilities:

-to make a for loop
-to generate a random number (typically these are uniformly distributed between 0 and 1)
-to compute and store a sum into a variable (addition subtraction etc)
-to make an array (which is like a variable but it has an index, look it up in your programming language guide)
-to write an if-then statement

that is all you need! no more, no less. i think that anyone with basic programming experience, even super simple, could do this. (including you!) i think it will take you less than an hour.

---

here's your pseudocode for the case of 3 dice rolls:

for i=1:10000 (for loop through the number of random rolls you want to use, 10K is a good number)

- generate 3 uniform random numbers, call them x1, x2, and x3

- use those 3 random numbers to get yourself 3 dice rolls as follows:
if the rand number is between 0 and 1/6, you rolled a 1.
if the rand number is between 1/6 and 2/6 you rolled a 2, and so on up to 6.
make these into new variables r1, r2, and r3. for instance:
if x1 >= 0 and x1 < 1 then r1 = 1.
you can do this in a very ugly (but it will work) manner by writing lots of if-then statements.

- take your dice rolls r1, r2, and r3, and get your total score. you can again just write a whole bunch of if-then statements, nested within each other, to go through all the cases.
for example, if r1=r2 and r3< r1+r2, then my_dice_score=r1+r2.

- next you will need an array to store all 10,000 of the rolls. almost every programming language has an array structure. this is just a variable with an index, where you can store 10,000 of something without having to create new variable names each time. let's say you've made an array called all_my_sums. then you save the current results into it:

all_the_sums(i)=my_dice_score; (the index i refers to how far you've gotten into the for loop)

- end the for loop (which means you do all of the above 10,000 times)

then you will have a list of all 10,000 scores, stored into into the array all_the_sums. you can compute the mean, median, stdev of this. you can maybe also make a histogram. your programming language of choice probably has all those as one-line commands. there should be something like median(all_the_sums) that just prints the answer to the screen.

I feel like this is a very simple programming project that you could do in an hour and learn something! you can do it!
posted by kellybird at 12:05 AM on December 1, 2012 [3 favorites]

if you are daunted, you might start by testing each of the operations alone, before you try to compose them all into a program.

for instance, generate a random number and print it to the screen.

make an array with 3 things in it, take the median.

write a simple for loop that prints out "hello world" each time.

write the code that gives you three random numbers and three dice rolls from those, and make sure it works.

practice making an if-then statment.

etc.

once you are comfortable doing each individual thing, you can combine them all into a program.
(ps, remember that you want to enumerate ALL the cases, so check if r1=r2, but also if r2=r3, and r1=r3.)

sorry if this is so simple as to insult your intelligence. i am basically a programming master (by training, anyway, haha) and i go through this exact process whenever i am encountering some new function or have to work my way through a complicated (to me) task.
posted by kellybird at 12:10 AM on December 1, 2012 [1 favorite]

oops, i made a typo two posts above... when you are computing the dice roll r1 from the uniform random variable x1 it should have said:

if x1 >= 0 and x1 (is less than 1/6) then r1 = 1

if x1 >= 0 and x1 < 1 then r1 = 1

(metafilter mixed up the formatting with all the greater than and less than signs turning into html etc)
posted by kellybird at 12:22 AM on December 1, 2012

Agreed that pla's solution works out quite nicely for the number of dice you're looking at.

An approach that works well for somewhat larger values of N—probably up to about 120, but depending on how fast your computer is and how long you're willing to wait—would be for the values which are iterated to be the number of dice showing a given value. This is a bit more complex because, unlike pla's algorithm in which each outcome enumerated is equally likely, you have to determine the number of outcomes corresponding to a particular situation. However, that's not too hard—it's just N!/(n1!*n2!*n3!*n4!*n5!*n6!)—where n1 is the number of dice showing a 1, n2 is the number of dice showing a 2, and so forth. This requires just 5 loops (more generally, for xDy, just y-1 loops rather than x loops). The number of possibilities to be enumerated can be shown to be C(N+5,5) or (N+5)!/(N!*5!). For 100d6 that's about 96 million possibilities. Even for smallish numbers of dice this method may be considerably more efficient: for 12d6, you only have to look at 6188 possibilities rather than the 2 billion required by pla's method.

The basic idea, as pseudocode, is:
```for n1=0 to N
for n2=0 to N-n1
for n3=0 to N-n1-n2
for n4=0 to N-n1-n2-n3
for n5=0 to N-n1-n2-n3-n4
n6=N-n1-n2-n3-n4-n5
calculate result for that combination of dice
calculate number of outcomes for that combination of dice and add to a tabulation
next n5
next n4
next n3
next n2
next n1
calculate required data (mean, median, mode, etc.)
```
A further optimization could be made by inverting the order of the loops—making n6 the outermost loop and n2 the innermost—and after each value is determined, check whether the remaining dice could possibly make a difference, and if not, skip the inner loops. E.g., after determining the number of 6's, calculate the sum of the 6's and ask if it would change the result if all remaining dice were 5's (if at least 5/11 of the dice are showing 6's, it would not), and if not, take the result right then and skip the inner loops. If so, then cycle through possible numbers of dice showing 5's, and for each one calculate the result based on 6's and 5's alone, ask if it would change the result if all remaining dice were 4's, and if not, skip the inner loops, and so forth. Of course, you need to alter the calculation for the number of results when you skip some of the inner loops.
posted by DevilsAdvocate at 5:59 AM on December 1, 2012

This sounds like a multinomial distribution. Consider the polynomial (a+b+c+d+e+f)^n where n is the number of dice. Each term is a multinomial coefficient times a product of powers of a through f. Each such term represents an outcome of a throw with the coefficient giving the multiplicity of its occurrence. To calculate the mean, what you want to do is replace each term with its score and divide by 6^n (the total number of possible outcomes which is also the sum of the multinomial coefficients.)

I could probably say this more clearly if I weren't defeated by the notational task.
posted by Obscure Reference at 6:22 AM on December 1, 2012

obscure reference - it's not a multinomial distribution because you are taking maxima, and because of the weird rules for getting the total score of the roll based on the 3 dice. there isn't a simple closed form here. it can probably be derived with a few pages of algebra and looking up results about maximum statistics of integers. however that is probably beyond the scope of what the poster wants to do.
posted by kellybird at 8:15 AM on December 1, 2012

The maximum and the weird rules would be accounted for in replacing each term with its score. E.g. coef * a^3 * b^2 * e * f would be counted as the maximum of 3*a, 2*b, e, and f.
where a, b, e, and f are taken from the numbers 1 through 6 and n = 3 + 2 + 1 + 1 = 7 in this example.
posted by Obscure Reference at 8:37 AM on December 1, 2012

Here you go. BungaDunga's script works perfectly in R. For numbers of dice 1 to 40, the mean can be approximated as 3.54 + 1.08*N (where N is the number of dice), although small numbers of dice will be less than the approximation and large numbers will be slightly higher. Note that the median is somewhat unpredictable, but always pretty close to the mean.
posted by one_bean at 9:37 AM on December 1, 2012 [1 favorite]

Wait, but I am I wrong that the ratio mean/N has to go to 1 in the limit?
posted by escabeche at 10:41 AM on December 1, 2012

It does, but not until you get into the 1000's of dice.
posted by one_bean at 10:53 AM on December 1, 2012

Wait, but I am I wrong that the ratio mean/N has to go to 1 in the limit?

In the limit you'll get N/6 sixes, so yes. For example, simulating throwing 103 dice 105 times I get a mean of 1001.96 and a standard deviation of 67.90 (therefore a standard error of 0.21).

To get the asymptotics you can probably figure that if the result isn't coming from the sixes, it's coming from the fives, and the limiting distribution of the number of fives and sixes is jointly Gaussian.

Also, there's probably a better way than the naive one of simulating draws from a multinomial distribution, but I don't feel like thinking about that right now. (I literally just threw out all my notes from when I taught probability, because I don't want to drag those around.)
posted by madcaptenor at 2:25 PM on December 1, 2012

« Older Raw buttermilk vs. raw whole milk   |   Please Recommend a Case with 8x SATA Hotswap Bays Newer »