A variation on the birthday paradox
July 14, 2010 5:32 PM   Subscribe

I have 232 facebook friends, and six of them share a birthday in common (not 6 different pairs of people on the same birthday, but 6 out of the 232 were born on the 17th of July. Now, I know it only takes 23 random people to get a 50% chance of at least ONE birthday collision, but how the heck do I figure out the odds on this one? Is this a significant anomaly, or reasonably expected? My one stats class was entirely too long ago....
posted by um_maverick to Education (25 answers total) 3 users marked this as a favorite
 
This previous question might be of some help as far as the links go.
posted by ishotjr at 5:38 PM on July 14, 2010


Especially this comment. It seems that birthdays may be clustered around July, August, and September. I have a disproportionate amount of FB friends with late June, early July, and March birthdays.
posted by ishotjr at 5:40 PM on July 14, 2010


My husband put the wrong date in his profile because he "doesn't want Facebook to know the real one" and he put it in July.
posted by Kimberly at 5:42 PM on July 14, 2010


Yup, I have this trend, too, in May/June/July. Six birthdays today (out of 820 friends). Usually it's one or two.
posted by roomthreeseventeen at 5:43 PM on July 14, 2010


Response by poster: Hmm, maybe my question was misleading - I'm not looking for whether or not july birthdays are common, but the formula for "given 232 friends, assuming random distribution (even though we now know this isn't the case) what are the odds of 6 of them sharing one birthday"

Thanks!
posted by um_maverick at 6:08 PM on July 14, 2010


Best answer: Well, if you want a theoretical answer, assuming each of 365 days of the year is equally likely as a birthday (pretending leap years don't exist), it's not a trivial question. But the analysis would go something like this.

Let P(a0, a1, a2, a3...) represent the probablity that, given a1+2*a2+3*a3... people, their birthdays are distribitued such that there are a0 days on which no people have birthdays, a1 days on which exactly one person has a birthday, a2 days on which exactly 2 people have birthdays, etc. We have the constraint that a0+a1+a2... = 365.

For example, P(359, 4, 1) is the probability that given six people, two of them will share a birthday, and the other four all have different birthdays (from those two and from each other).

Start with P(365) = 1. (The trivial case: given 0 people, the probability that there are 365 days on which no one has a birthday is 1.)

Likewise, P(364, 1) = 1. (Given one person, the probability that there is one day which is that person's birthday, and 364 days which are no one's birthday, is 1. This can actually be derived from the previous case and the rules I'm about to lay down, but it may be easier to list separately since it's also a trivial case.)

Now, given N people, the probabilities for each possible case can be derived from the probabilities for N-1 people as follows:

P(a0, a1... ak-1, ak+1+1...) = P(a0, a1... ak, ak+1...)*ak/365.

Except that a given set of probabilities for N people may be "generated" from a set of N-1 people in multiple ways, in which case you have to add up the probabilities of all those possible ways. Also, if ak+1 didn't exist in the previous set, pretend it was there and was zero. (Essentially, any finite string of an's can actually be thought of as infinite, with all the unlisted an's as zero.)

So proceeding, the possibilities for 2 people are:

P(363, 2) [different birthdays] = P(364,1)*364/365 = 364/365

P(364, 0, 1) [same birthday] = P(364,1)*1/365 = 1/365

For 3 people:

P(362,3) [all three different birthdays] = P(363,2)*363/365 = 132132/133225

P(363,1,1) [two of the three share a birthday] = P(363,2)*2/365 + P(364, 0, 1)*364/365 = 1092/133225 [note this is the first case where we had to add multiple terms from the cases with N-1 people]

P(364,0,0,1) [all three share the same birthday] = P(364,0,1)*1/365 = 1/133225

Continue this procedure until you've analyzed the cases for N=232, and add up all the probabilities for which one of a6, a7, a8... is greater than zero, and that will give you the probability that given 232 people, at least six will share the same birthday.

I might try to program this later, but if anyone else wants to take a shot at it first, that's fine with me. Alternately, a Monte Carlo simulation would be a lot easier to program, although it's accuracy will of course be based on how many trials you run.
posted by DevilsAdvocate at 6:19 PM on July 14, 2010 [2 favorites]


Best answer: Figured out a way to do it without that clumsy recursion:

Think of the problem this way: imagine you have a one-dimensional board. The left-most space is 0, and to the right of it, the spaces are numbered 1, 2, 3, etc. The board extends infinitely to the right.

Start with 365 tokens on space 0. Each "move" consists of selecting 1 token at random, and moving it one space to the right. If you make 232 moves, what is the probability that at least one token has reached space 6 or greater?

The approach is to solve the converse problem—what is the probability that no token has moved beyond space 5—then subtract that probability from 1. (Which is akin to the way you solve the single collision problem: figure out what the probability is that no two people share a birthday, then subtract that from 1.)

So, list all possible distributions of tokens such that no token is beyond space five. I.e., list all solutions, in non-negative integers, of a0, a1... a5 given the restrictions:
a0+a1+a2+a3+a4+a5=365
a1+2*a2+3*a3+4*a4+5*a5=232

For each of these, P(a0, a1... a5) = (number of ways of selecting specific tokens to fill those spaces) * (number of "orders" in which the 232 moves could have occurred) * probability of a specific sequence of moves occurring.

Ways of selecting tokens to fill those spaces = 365!/a0!a1!a2!a3!a4!a5!

Number of arrangements of moves, given a specific set of tokens in each space = 232!/(2!a2 * 3!a3 * 4!a4 * 5!a5)

Probability of a specific sequence of moves occurring = 1/365232

Thus, P(a0,a1,a2,a3,a4,a5) = (365!/a0!a1!a2!a3!a4!a5!) * (232!/(2!a2 * 3!a3 * 4!a4 * 5!a5)) / 365232

Add up the P values for each of the possible solutions for a0...a5, and that gives you the probability that no token has moved beyond space 5, i.e., that no more than 5 people share any birthday. Subtract that from 1 to get the probability that at least 6 people share at least one birthday.

I'll see if I can work that out...
posted by DevilsAdvocate at 6:56 PM on July 14, 2010 [1 favorite]


Slight correction to my first comment:

For example, P(359 360, 4, 1) is the probability that given six people, two of them will share a birthday, and the other four all have different birthdays (from those two and from each other).
posted by DevilsAdvocate at 6:59 PM on July 14, 2010


Best answer: Like lots of hard probability problems, it's easier to simulate.

> x<-round(runif(n=232*10000)*364)
> x<-matrix(x,nrow=10000)
> y<-apply(x,1,sort)
> z<-apply(y,2,function(x){max(rle(x)$lengths)})
> table(z)
z

3 4 5 6 7 8
2156 6163 1485 185 10 1


So in 196/10000 cases (about 2%) you had 6 or more friends with the same birthday.
posted by a robot made out of meat at 7:21 PM on July 14, 2010 [3 favorites]


So, list all possible distributions of tokens such that no token is beyond space five. I.e., list all solutions, in non-negative integers, of a0, a1... a5 given the restrictions:
a0+a1+a2+a3+a4+a5=365
a1+2*a2+3*a3+4*a4+5*a5=232


If I've done my work right, there are 1,141,886 solutions to these equations.
posted by DevilsAdvocate at 7:25 PM on July 14, 2010 [1 favorite]


Oh, I forgot, 365! is on the order of 10778, so I need to either work with something that can handle numbers that big, or else use approximations such as Stirling's approximation...
posted by DevilsAdvocate at 7:42 PM on July 14, 2010


Best answer: This program does a simulation of the problem as I understand it.
Written in Python, requires Numpy (took way too long to run without it).
import random, numpy, numpy.random
runs=100000
friends=232
days=numpy.arange(366)
def bday_sim():
    # Array of 232 random numbers
    bdays = numpy.random.randint(1, 366, friends)
    # For each day of the year, # of friends with that birthday
    counts = numpy.histogram(bdays, days)[0]
    if counts.max() >= 6: return 1.0
    else: return 0.0

sum=0
for i in range(runs): sum += bday_sim()

print sum / float(runs)
Someone please check if I made any glaring mistakes in that code. I know it doesn't handle leap years.

Gives me an answer of 0.02 (2% chance of this happening). By contrast, if you increase to 500 friends, there's a 67% chance of this happening.
posted by miyabo at 7:47 PM on July 14, 2010 [1 favorite]


I think you could estimate this with the Poisson distribution (at least, my answers match the simulation answers of 'a robot made out of meat' ! )...

http://en.wikipedia.org/wiki/Poisson_distribution

On any average day, you would expect 232/365 = 0.6356 birthdays (ie lambda = 0.6356).

p(6 on the same day) = 365 * 0.6356^6 * exp( - 0.6356) / 6! = 0.01770461
p(7 on the same day) = 365 * 0.6356^7 * exp( - 0.6356) / 7! = 0.001607620
p(8 on the same day) = 365 * 0.6356^8 * exp( - 0.6356) / 8! = 0.0001277287

etc... if you sum up these probabilities, there is about a 2% chance that, of 232 friends, 6 or more would have the same birthday
posted by JumpW at 7:48 PM on July 14, 2010 [3 favorites]


Oh good, three of us have estimated 2% so far! What could be the probability that we're all wrong?
posted by JumpW at 7:51 PM on July 14, 2010 [1 favorite]


I worked through my "exact" solution as described (well, exact as you can get considering the possibility of roundoff error when adding over a million numbers) and got ~0.018470917, just under 2%.

For the curious, the single most probable distribution was 1 date shared by 4 people, 8 dates shared by 3 people each, 40 dates shared by 2 people each, and 124 people with unique birthdays, with a probability of 0.004728677.
posted by DevilsAdvocate at 8:22 PM on July 14, 2010 [2 favorites]


Silly me. No need to use Stirling's approximation when I'm working in logarithms anyway up to the point where I get ln(P). It's easy enough to calculate ln(N!) directly for N up to 365.

So, a slight revision to my earlier answer: I now get 0.018318139 for the probability, and 0.004729403 for the most likely single distribution.
posted by DevilsAdvocate at 9:03 PM on July 14, 2010


it's actually much easier than all that...

for any given group of 6 people, they have a 1/(365^5) chance of all having the same birthday...

there are 232C6 combinations of 6 people possibe from a pool of 232, which is 232!/227!/6! or alternatively (232*231*...*228*227)/6!

so the odds of there being at least one group of 6 people sharing a birthday from a pool of 232 is

(232*231*...*228*227)/(365^5 * 6!) = 0.031320390532716

so a touch over 3%
posted by russm at 2:48 AM on July 15, 2010


so the odds of there being at least one group of 6 people sharing a birthday from a pool of 232 is

(232*231*...*228*227)/(365^5 * 6!) = 0.031320390532716


This would be true only if the possibilities of any two groups of six people (even overlapping groups) having the same birthday were mutually exclusive.

In other words, yes, it's true that the probability of A, B, C, D, E, and F having the same birthday is 1/3655. It's also true that the probability of G, H, I, J, K, and L having the same birthday is 1/3655. It is not true that the probability of at least one of the two groups (A, B, C, D, E, and F) and (G, H, I, J, K, and L) all having the same birthday is 2/3655.

It is depressingly common in complex AskMes about probability for people to assume P(A or B)=P(A)+P(B) when A and B are not mutually exclusive events; or that P(A and B)=P(A)*P(B), or P(A or B)=1-(1-P(A))(1-P(B)), when A and B are not independent events.
posted by DevilsAdvocate at 5:17 AM on July 15, 2010 [1 favorite]


Response by poster: Wow, thanks everybody - my 11-year-old memory of my stats class had me thinking it was far simpler than this. Apparently I was way, way off! Thanks!
posted by um_maverick at 5:47 AM on July 15, 2010


DevilsAdvocate: What are you talking about? Those groups are completely different, so the events are independent. Did you mean to have the groups overlap?

Also (not directed at DA, just at the general public), the words "odds" and "probability" are not interchangeable. In a casual conversation, say what you will, but when discussing a problem like this, we should be more careful.
posted by King Bee at 7:44 AM on July 15, 2010


What are you talking about? Those groups are completely different, so the events are independent.

Yes, in the example I gave (two disjoint groups of six people each) they are independent. russm was not treating them as independent (i.e., P(A or B) = 1-(1-P(A))(1-P(B)); he was treating them as mutually exclusive (i.e., P(A or B) = P(A)+P(B)).

russm's logic was: There are G=232!/226!6!=202904412172 distinct groups of 6 people. The probability of any single group of six people having the same birthday is P=1/3655. Therefore, the probability of any six people sharing the same birthday is G*P.

russm's conclusion only follows if each of the G groups of six people sharing the same birthday are mutually exclusive (not independent) events. The fallacy is more easily seen with a simpler situation: given 50 people, what is the probability of any two sharing a birthday? If we followed russm's logic, we find G=50!/48!2! = 1225 distinct pairs of people, each having a 1/365 chance of sharing their birthdays, and the probability of any two people sharing a birthday is 1225/365≈3.356.

Had russm assumed the events in question were independent rather than mutually exclusive, he would have used the calculation 1-[(1-1/3655)202904412172] ≈ 0.030834987966224, slightly lower than his earlier answer, but still too high by over 50%.

As you correctly note, two groups of six people having all six birthdays in common are independent events if the two groups are entirely disjoint (or even if they have just one member in common), but they are not independent if the groups have two or more members in common. E.g., the probability that at least one of the groups (A, B, C, D, E, F) and (B, C, D, E, F, G) has all its members' birthdays in common is not 1-(1-1/3655)2. Thus it is incorrect to calculate the probability of at least 6 out of 232 people sharing a birthday as if we were considering 202904412172 independent events (not what russm did), and also wrong to calculate it as if we were considering 202904412172 mutually exclusive events (what russm did).

Good point about "odds." Given that the OP asked for the odds, they are about 53.59:1 against.
posted by DevilsAdvocate at 8:22 AM on July 15, 2010


good point... and having turned away from attempting an analytic solution using my 15 year old engineering stats and instead gone for a simulation I see the answer is ~0.0182, so I'll just shut up now...
posted by russm at 8:54 AM on July 15, 2010


Ah, I knew I had to have misunderstood your example, DA. That's what I get for waking up and immediately trying to do mathematics.
posted by King Bee at 10:09 AM on July 15, 2010


I love these sorts of problems. I think russm was onto something, but didn't quite get all the way there.

There are C(232, 6) ways to choose 6 people from 232. The remaining 226 people must then be distributed across the remaining 364 days, which can be done in 364226 ways. The size of the sample space is 365232. So we get

C(232, 6) * 364226 * 365-232 = 4.61593454425e-5

This is the probability of all six having the same particular birthday (say July 17) in common. If we don't care about the day we multiply by 365 to get 0.01684, or about 1.7%.
posted by lex mercatoria at 11:35 AM on July 15, 2010


There are C(232, 6) ways to choose 6 people from 232. The remaining 226 people must then be distributed across the remaining 364 days, which can be done in 364226 ways.

First, that would give you the ways in which exactly 6 people shared a specific birthdate, e.g., that exactly 6 people were born on July 7. I (and apparently most others) took the OP's question to be about the probability that at least 6 people shared a birthdate. Granted, the OP didn't make explicit which he meant, so if that's the question you're answering, that's fine as far as it goes.

But:

This is the probability of all six having the same particular birthday (say July 17) in common. If we don't care about the day we multiply by 365 to get 0.01684, or about 1.7%.

You have made the same mistake russm did: the events are not mutually exclusive. "Six people have a birthday on July 17" does not exclude the possibility "six people have a birthday on January 3," thus you cannot simply add the probabilities to get the probability of at least one happening, either. (They're also not independent events.)
posted by DevilsAdvocate at 12:10 PM on July 15, 2010


« Older Getchyer Hot Donuts! Hot Donuts Here!   |   Where should I move? Newer »
This thread is closed to new comments.