About Benford's Law
February 13, 2014 9:17 AM Subscribe
So in many data sets, the leading digit has a 30.1% chance of being 1, and decreasing on down the line.
Fine. But what about NON-leading digits? Are those also irregularly distributed in naturally-occurring data sets, or are they just 11.11% chance, as a layman would expect?
The naive guess would be 10% probability that any particular number would occur in a non-leading digits (because there are ten possible numbers: 0,1,..,9).
Wikipedia says that for second digits it is not quite an even 10% probability for each number, but that once you get to the fourth digit it approaches a uniform distribution where each number has a 10% change of occurring.
posted by JumpW at 9:30 AM on February 13, 2014
Wikipedia says that for second digits it is not quite an even 10% probability for each number, but that once you get to the fourth digit it approaches a uniform distribution where each number has a 10% change of occurring.
posted by JumpW at 9:30 AM on February 13, 2014
Best answer: The principle behind Benford's Law is that, for data sets distributed according to it, if you express the values in scientific notation (n*10m, where 1≤n<10, and m is an integer, the probability of any n appearing is proportional to 1/n.
So the probability of the first digit being 1, i.e., 1≤n<2, is (∫12 1/n dn) / (∫110 1/n dn)
Since 1/n is larger for smaller values of n, the probability of n being between 1 and 2 is much larger than the 1/9 of the range it makes up.
You can use the same principle to derive the probability of the second digit: the probability that the second digit is one is the probability that n is between 1.1 and 1.2, or between 2.1 and 2.2, or between 3.1 and 3.2 ... or between 9.1 and 9.2. This will still show some preference for smaller digits, since 1/n is larger between 1.1 and 1.2 than it is between 1.2 and 2; and larger between 2.1 and 2.2 than between 2.2 and 3, and so forth. But the effect is much less pronounced. (Also note that the most likely second digit is zero, which is not possible for the first digit.) So the theoretical probability that the second digit is one would be:
(∫1.11.2 1/n dn + ∫2.12.2 1/n dn + ∫3.13.2 1/n dn + ... + ∫9.19.2 1/n dn) / (∫110 1/n dn)
The effect becomes less pronounced with each additional digit: for the first digit, you are taking the first 1/9 of the range; for the second, you are taking nine slices, each 1/90 out of the entire range, spaced 1/10 of the range apart; for the third digit, you are taking 90 slices, each 1/900 of the entire range, spaced 1/100 of the range apart, and so forth. In each case, the slices containing the desired digit take up a total of 1/9 (for the first digit) or 1/10 (for any digit after the first) of the total range, but as you go further to the right, the slices become more numerous and more evenly spaced throughout the entire range.
posted by DevilsAdvocate at 9:53 AM on February 13, 2014 [4 favorites]
So the probability of the first digit being 1, i.e., 1≤n<2, is (∫12 1/n dn) / (∫110 1/n dn)
Since 1/n is larger for smaller values of n, the probability of n being between 1 and 2 is much larger than the 1/9 of the range it makes up.
You can use the same principle to derive the probability of the second digit: the probability that the second digit is one is the probability that n is between 1.1 and 1.2, or between 2.1 and 2.2, or between 3.1 and 3.2 ... or between 9.1 and 9.2. This will still show some preference for smaller digits, since 1/n is larger between 1.1 and 1.2 than it is between 1.2 and 2; and larger between 2.1 and 2.2 than between 2.2 and 3, and so forth. But the effect is much less pronounced. (Also note that the most likely second digit is zero, which is not possible for the first digit.) So the theoretical probability that the second digit is one would be:
(∫1.11.2 1/n dn + ∫2.12.2 1/n dn + ∫3.13.2 1/n dn + ... + ∫9.19.2 1/n dn) / (∫110 1/n dn)
The effect becomes less pronounced with each additional digit: for the first digit, you are taking the first 1/9 of the range; for the second, you are taking nine slices, each 1/90 out of the entire range, spaced 1/10 of the range apart; for the third digit, you are taking 90 slices, each 1/900 of the entire range, spaced 1/100 of the range apart, and so forth. In each case, the slices containing the desired digit take up a total of 1/9 (for the first digit) or 1/10 (for any digit after the first) of the total range, but as you go further to the right, the slices become more numerous and more evenly spaced throughout the entire range.
posted by DevilsAdvocate at 9:53 AM on February 13, 2014 [4 favorites]
Fun fact illustrating the 10% convergence: Until at least the early 1970s, New York newspapers used to publish, every day, the "U.S. Daily Treasury Balance" which was the cash on hand at the United States Treasury. This was usually at least an 11-digit number. The reason was that the last three digits, excluding the cents (because not every paper published the cents and at some point the Treasury started rounding to the nearest dollar) would be the "daily number" in the local mob's numbers racket, which is what folks used to gamble on before state lotteries came along. You picked a three-digit number, gave it to the numbers runner with your bet, and if your number "came in" — matched the last three of the Treasury balance — he'd bring you back the payoff at 600:1. So, if mob statisticians had determined that those last three digits were sufficiently random, they must have been pretty close to a 10.0% probability.
(Newspapers were clearly colluding with the mob here, since the daily Treasury balance had no particular usefulnesss otherwise, but probably mainly on the theory that publishing the number would sell more papers. Here's a good rundown on how the game worked in the 50s.)
posted by beagle at 9:54 AM on February 13, 2014 [8 favorites]
(Newspapers were clearly colluding with the mob here, since the daily Treasury balance had no particular usefulnesss otherwise, but probably mainly on the theory that publishing the number would sell more papers. Here's a good rundown on how the game worked in the 50s.)
posted by beagle at 9:54 AM on February 13, 2014 [8 favorites]
Based on the formulas I gave above, here's the (rounded) theoretical probabilities for the first four digits:
First digit:
1: 30.103%
2: 17.609%
3: 12.494%
4: 9.691%
5: 7.918%
6: 6.695%
7: 5.799%
8: 5.115%
9: 4.576%
Second digit:
0: 11.968%
1: 11.389%
2: 10.882%
3: 10.433%
4: 10.031%
5: 9.668%
6: 9.337%
7: 9.035%
8: 8.757%
9: 8.500%
Third digit:
0: 10.178%
1: 10.138%
2: 10.097%
3: 10.057%
4: 10.018%
5: 9.979%
6: 9.940%
7: 9.902%
8: 9.864%
9: 9.827%
Fourth digit:
0: 10.018%
1: 10.014%
2: 10.010%
3: 10.006%
4: 10.002%
5: 9.998%
6: 9.994%
7: 9.990%
8: 9.986%
9: 9.982%
Unless you have a humongous data set (probably on the order of millions of values), you won't be able to see a statistically significant difference in the fourth digit.
posted by DevilsAdvocate at 10:15 AM on February 13, 2014 [1 favorite]
First digit:
1: 30.103%
2: 17.609%
3: 12.494%
4: 9.691%
5: 7.918%
6: 6.695%
7: 5.799%
8: 5.115%
9: 4.576%
Second digit:
0: 11.968%
1: 11.389%
2: 10.882%
3: 10.433%
4: 10.031%
5: 9.668%
6: 9.337%
7: 9.035%
8: 8.757%
9: 8.500%
Third digit:
0: 10.178%
1: 10.138%
2: 10.097%
3: 10.057%
4: 10.018%
5: 9.979%
6: 9.940%
7: 9.902%
8: 9.864%
9: 9.827%
Fourth digit:
0: 10.018%
1: 10.014%
2: 10.010%
3: 10.006%
4: 10.002%
5: 9.998%
6: 9.994%
7: 9.990%
8: 9.986%
9: 9.982%
Unless you have a humongous data set (probably on the order of millions of values), you won't be able to see a statistically significant difference in the fourth digit.
posted by DevilsAdvocate at 10:15 AM on February 13, 2014 [1 favorite]
beagle: " So, if mob statisticians had determined that those last three digits were sufficiently random, they must have been pretty close to a 10.0% probability."
That's... an incredibly naive view on mob statisticians. More likely, if mob statisticians had determined those digits were not completely random, they found a way to hedge their bets.
Put another way - why would the mob want the game to be fair?
posted by IAmBroom at 1:15 PM on February 14, 2014
That's... an incredibly naive view on mob statisticians. More likely, if mob statisticians had determined those digits were not completely random, they found a way to hedge their bets.
Put another way - why would the mob want the game to be fair?
posted by IAmBroom at 1:15 PM on February 14, 2014
« Older Digital cornicopia looking for next move in... | Trying to identify snippet of music, maybe... Newer »
This thread is closed to new comments.
Here's a Source.
posted by grudgebgon at 9:28 AM on February 13, 2014