October 27, 2012 1:17 PM Subscribe

Statisticsfilter: Given available information about the distribution of self-selected 4-digit passwords (specifically banking PINs), is it possible to calculate the probability of two randomly selected individuals having the same PIN? If so, what're the odds?

What I'd like to know is, if you had the sample set used here (a compelling analysis of leaked 4-digit passwords), could you calculate the probability of two random users in that dataset sharing a PIN? Given the information visualized there, could you? (Turn the infographic back into numbers, do some math magic?)

If not, what data*would* you need in order to approximate a probability? Or if so, what are the odds?

Other perhaps-relevant thing I found while googling around: Probability of the same PIN digits, author assumes uniform distribution (I know that's an actual probability/statistics term, but that's all I know).
posted by myrrh to Computers & Internet (15 answers total) 2 users marked this as a favorite

What I'd like to know is, if you had the sample set used here (a compelling analysis of leaked 4-digit passwords), could you calculate the probability of two random users in that dataset sharing a PIN? Given the information visualized there, could you? (Turn the infographic back into numbers, do some math magic?)

If not, what data

Other perhaps-relevant thing I found while googling around: Probability of the same PIN digits, author assumes uniform distribution (I know that's an actual probability/statistics term, but that's all I know).

An interesting question! Pick two people at random; the probability that they both have the password 1234 is (0.10713)(0.10713) or about 0.0115. Compute numbers like this for all 10,000 possible passwords, add them up, and you get the result.

So you get a sum with ten thousand terms:

(0.10713)(0.10713) + (0.06016)(0.06016) + (0.01881)(0.01881) + ... + (0.00000744)(0.00000744)

where I'm working from the "twenty most popular passwords" and "twenty least popular passwords" table. The problem is that from the tables I can only get the first twenty and last twenty terms of this sum!

Given the raw data, then, this would be easy. But the post says he won't give out the raw data. I may give some thought to how to get this out of the infographic.

posted by madcaptenor at 1:27 PM on October 27, 2012 [2 favorites]

So you get a sum with ten thousand terms:

(0.10713)(0.10713) + (0.06016)(0.06016) + (0.01881)(0.01881) + ... + (0.00000744)(0.00000744)

where I'm working from the "twenty most popular passwords" and "twenty least popular passwords" table. The problem is that from the tables I can only get the first twenty and last twenty terms of this sum!

Given the raw data, then, this would be easy. But the post says he won't give out the raw data. I may give some thought to how to get this out of the infographic.

posted by madcaptenor at 1:27 PM on October 27, 2012 [2 favorites]

The main problem with this analysis is that it neglects the significant fact that often people will choose weak passwords for websites they don't care about (which is a big source of leaked passwords) while they would not use 1234 as their actual banking pin.

So even if you had the full data you'd need a lot more data to know how well your conclusions could be transferred from throw-away website passwords to ATM PINs.

posted by aubilenon at 1:31 PM on October 27, 2012 [1 favorite]

I think a good deal of PIN numbers are randomly assigned, it's been a while since I last got one but I'm fairly certain I didn't pick it when I did.

In addition to what aubilenon said, it's worth bearing in mind the sort of people who think 4 digit numbers are acceptable for website passwords are also likely to be the same sort of people who think 1234 is a good pin number, it's unlikely that it matches the entire population.

posted by purplemonkeydishwasher at 2:00 PM on October 27, 2012

In addition to what aubilenon said, it's worth bearing in mind the sort of people who think 4 digit numbers are acceptable for website passwords are also likely to be the same sort of people who think 1234 is a good pin number, it's unlikely that it matches the entire population.

posted by purplemonkeydishwasher at 2:00 PM on October 27, 2012

Alright, here's a longer version of the analysis:

- the proportion of people using the most commonly-used 10000x PINs, collectively, appears to be about x^{0.2154}. (To get this I just grabbed some points on the curve; then I played around with various transformations of the variables until they fell on a straight line, and did linear regression of log(y) against log(x) fixing the intercept at 0.) For example, if you plug in x = 0.5 you get 0.8613; the most commonly used half of the PINs represent about 86% of all PINs, which agrees with the graph. Similarly if you plug in x = 0.0426 you get 0.5067, roughly agreeing with the statement from the post:

- the probability that an individual is using one of the n most popular PINs is therefore (about) (n/10000)^(0.2154);

- the probability that an individual is using the*n*th most popular PIN is therefore (about) p(n) = (n/10000)^(0.2154)-((n-1)/10000)^(0.2154). (For those who know calculus, you can come up with a nice approximation for this...);

- the probability that two randomly chosen individuals are*both* using that PIN is therefore p(n)^{2}.

- the sum p(21)^{2} + p(22)^{2} + ... + p(10000)^{2} = 0.00027; this is the probability that two randomly-chosen people will be using the same PIN, and it's not one of the twenty most common. (Why start at 21? Because we have the frequency of the twenty most frequent PINs, so there's no need for approximation.) (Like calculus? Approximate this by an integral.)

- this is actually a small correction to the main term, which is just (0.1071)(0.1071) + ... + (0.0029)(0.0029) = 0.01593. the sum of the squares of the probabilities of the twenty most frequent PINs. Not surprisingly, if two people are both using the same PIN it's very likely to be one of the most common. (In fact if two people both have the same PIN, there's about a 70% chance it's 1234!)

**My best guess of the answer to your question is therefore 0.01593 + 0.00027 = 0.01620. **

(I used to teach probability and statistics. I don't teach any more. If I did, I would have not answered this, and instead found a way to get my students to do it, because it's a nice problem.)

posted by madcaptenor at 2:02 PM on October 27, 2012 [6 favorites]

- the proportion of people using the most commonly-used 10000x PINs, collectively, appears to be about x

The 50% cumulative chance threshold is passed at just 426 codes (far less than the 5,000 that a random uniformly distribution would predict). Paranoid yet?

- the probability that an individual is using one of the n most popular PINs is therefore (about) (n/10000)^(0.2154);

- the probability that an individual is using the

- the probability that two randomly chosen individuals are

- the sum p(21)

- this is actually a small correction to the main term, which is just (0.1071)(0.1071) + ... + (0.0029)(0.0029) = 0.01593. the sum of the squares of the probabilities of the twenty most frequent PINs. Not surprisingly, if two people are both using the same PIN it's very likely to be one of the most common. (In fact if two people both have the same PIN, there's about a 70% chance it's 1234!)

(I used to teach probability and statistics. I don't teach any more. If I did, I would have not answered this, and instead found a way to get my students to do it, because it's a nice problem.)

posted by madcaptenor at 2:02 PM on October 27, 2012 [6 favorites]

To add to madcaptenor's comment, one crude way of approximating the data is to use the exact data to calculate the first 20 terms, then distribute the remainder of the data into uniform bins of probability based on estimates of the slopes in different segments cumulative incidence plot provided.

About 26.83% of the distribution is accounted for by the first 20 terms. Then between 26.83% and 33% are accounted for by the next 41 terms. Then between 33% and 50% is an additional 365 terms. By looking at the graph, between 50-60% is another 574 terms (total 1000). From there, the cumulative frequency looks approximately linear. So for the last 40%, assume an even distribution of 9000 more terms.

Now the sum is:

0.1071^2+0.0602^2+0.0188^2+0.0120^2+...+0.0029^2+41*(0.0015049^2)+365*(0.00046575^2)+574*(0.00017422^2)+9000*(0.00004444^2).

This would be my rough estimate. The answer is approximately 1.613% that two random people would have the same password. I'm guessing this is probably within 0.2% of the exact answer, and almost certainly an underestimate. If the pins were randomly selected the answer would be 0.01%. I suppose if all you had was that link, you could fit their plot better and get a better answer.

Hopefully I didn't screw up that math. Of course as noted, the dataset is probably not generalizable.

posted by drpynchon at 2:03 PM on October 27, 2012 [1 favorite]

About 26.83% of the distribution is accounted for by the first 20 terms. Then between 26.83% and 33% are accounted for by the next 41 terms. Then between 33% and 50% is an additional 365 terms. By looking at the graph, between 50-60% is another 574 terms (total 1000). From there, the cumulative frequency looks approximately linear. So for the last 40%, assume an even distribution of 9000 more terms.

Now the sum is:

0.1071^2+0.0602^2+0.0188^2+0.0120^2+...+0.0029^2+41*(0.0015049^2)+365*(0.00046575^2)+574*(0.00017422^2)+9000*(0.00004444^2).

This would be my rough estimate. The answer is approximately 1.613% that two random people would have the same password. I'm guessing this is probably within 0.2% of the exact answer, and almost certainly an underestimate. If the pins were randomly selected the answer would be 0.01%. I suppose if all you had was that link, you could fit their plot better and get a better answer.

Hopefully I didn't screw up that math. Of course as noted, the dataset is probably not generalizable.

posted by drpynchon at 2:03 PM on October 27, 2012 [1 favorite]

Or, you know, what Dr. Pynchon said. But I felt like doing the regression, because I can.

posted by madcaptenor at 2:03 PM on October 27, 2012

posted by madcaptenor at 2:03 PM on October 27, 2012

Basically, because the distribution is so concentrated, most of the time if a pair of people have the same PIN it's from the part of the distribution that's given in the table. drpynchon and I just had slightly different approaches to figuring out how much comes from all the PINs outside the twenty most common, but we both agree that it's a small correction.

posted by madcaptenor at 2:07 PM on October 27, 2012

posted by madcaptenor at 2:07 PM on October 27, 2012

Right. Also, I meant .02% for my eyeball guess at it. so probably somewhere between 1.613 and 1.633%.

posted by drpynchon at 2:17 PM on October 27, 2012

posted by drpynchon at 2:17 PM on October 27, 2012

If I had the actual data-set, it would be a five-minute job with a bootstrap statistical technique: just write a quick script that reads the PIN/frequency table, then repeatedly picks random PIN pairs weighted according to the table. Do it a million times, keep count of the number of times you get matching PINs, divide by a million, there's your answer. Then I'd probably get the same answer as drpynchon and madcaptenor, but without the inconvenience of having to engage my brain.

It looks as though the "heatmap" plot in the article might be sufficient to reconstruct the original data, but this would be a fiddly job, and would take more time and effort than actually running the analysis.

posted by pont at 2:20 PM on October 27, 2012 [1 favorite]

It looks as though the "heatmap" plot in the article might be sufficient to reconstruct the original data, but this would be a fiddly job, and would take more time and effort than actually running the analysis.

posted by pont at 2:20 PM on October 27, 2012 [1 favorite]

I chose my PIN.

When making the assumption that two people randomly selected individuals will have the same PIN (lets call that PIN overlap), you could make some explicit characterizations for what I would call common PINs to increase the probability of overlap.

If you make a set of four digit PINs such that

1. The PIN does not have the same number twice in a row

2. The same number does not appear three times or more in the PIN.

2. The first number of the PIN "touches" the second number of the pin, the second number of the PIN "touches" the third number of the PIN, and the third number of the PIN "touches" the fourth number of the PIN.

I would imagine that those users with numbers in that set would have much more PIN overlap than any numbers just because people are more likely to choose their PIN that way.

posted by oceanjesse at 2:46 PM on October 27, 2012

When making the assumption that two people randomly selected individuals will have the same PIN (lets call that PIN overlap), you could make some explicit characterizations for what I would call common PINs to increase the probability of overlap.

If you make a set of four digit PINs such that

1. The PIN does not have the same number twice in a row

2. The same number does not appear three times or more in the PIN.

2. The first number of the PIN "touches" the second number of the pin, the second number of the PIN "touches" the third number of the PIN, and the third number of the PIN "touches" the fourth number of the PIN.

I would imagine that those users with numbers in that set would have much more PIN overlap than any numbers just because people are more likely to choose their PIN that way.

posted by oceanjesse at 2:46 PM on October 27, 2012

Well, I grabbed the image (actually the small grayscale one) from the site and used Matlab's imread on it (with some fiddling), but now I have to run, so I dumped it into a CSV file for you guys to play around with. Note that he says the values are scaled logarithmically; I didn't make it linear before saving.

Here's the CSV.

posted by supercres at 2:52 PM on October 27, 2012 [1 favorite]

Here's the CSV.

posted by supercres at 2:52 PM on October 27, 2012 [1 favorite]

(Clarification: the range on these values is 0-255, and the first element of the last line would correspond to 0000, etc.)

posted by supercres at 2:53 PM on October 27, 2012

posted by supercres at 2:53 PM on October 27, 2012

This thread is closed to new comments.

posted by myrrh at 1:18 PM on October 27, 2012