Distinguishing Contrived Numbers
August 7, 2010 10:10 AM   Subscribe

A few years ago, I read that there is a mathematical method for discovering whether numerical data is contrived or random. There are interesting real-world uses for this, e.g. the IRS was considering it to flag returns where the amounts (e.g. listed deductions such as auto mileage) seemed contrived. I'm not explaining it very well (if I could explain it better, I'd have been able to Google it!), but I'm wondering if anyone can pin this down for me.
posted by Quisp Lover to Science & Nature (29 answers total) 20 users marked this as a favorite
 
Best answer: Benford's Law.
posted by dws at 10:12 AM on August 7, 2010 [2 favorites]


Best answer: Benford's Law.
posted by hjd at 10:13 AM on August 7, 2010


Response by poster: Thanks.

Was that perfect synchronicity a meta joke? ;)
posted by Quisp Lover at 10:15 AM on August 7, 2010


Response by poster: Wonder what's on that wikipedia page that consistently crashes Safari?
posted by Quisp Lover at 10:18 AM on August 7, 2010


By the way, I couldn't remember the name of the law off the top of my head either. In case you're curious about what Google-fu to use, I Googled number fraud random to find it.
posted by hjd at 10:21 AM on August 7, 2010


Quisp Lover -- Nothing on that Safari page crashes my Safari. (5.0.1 on SL)
posted by Brian Puccio at 10:28 AM on August 7, 2010


A bit more esoteric, but Zipf's Law is also used in fraud detection. It relates to Benford's Law.
posted by dws at 10:30 AM on August 7, 2010


I read the page. I understand the concept at a 20,000 ft level. Can anyone explain it in words using an example? This is really interesting.
posted by JohnnyGunn at 10:32 AM on August 7, 2010


JohnnyGunn: Hal Varian's article talks about its application in fraud investigations:
Benford's law is quite counterintuitive; people do not naturally assume that some digits occur more frequently. None of the check amounts was duplicated; there were no round numbers; and all the amounts included cents. However, subconsciously, the manager repeated some digits and digit combinations.
You could also relate it to, say, the fact that when people are given a 1-10 scale to rate things subjectively (e.g. movie reviews) they'll lean towards the top of the scale.
posted by holgate at 10:39 AM on August 7, 2010


JohnnyGunn, think of the numbers 1 through 100.

How many of those numbers contain the digit 1?

1,10,11,12,13,14,15,16,17,18,19,21,31,41,51,61,71,81,91: there are 19 numbers that contain the digit 1, and the digit 1 appears 20 times (two times in the case of 11).

Thus in the sequence 1 through 100, the digit 1 appears more than any other digit.
posted by dfriedman at 10:47 AM on August 7, 2010


Compare to the number of 2s which appear in the same sequence: 2, 12, 22, 32, 42, 52, 62, 72, 82, 92: 2 appears 11 times, in 10 separate digits, or half as many times as the digit 1.
posted by dfriedman at 10:50 AM on August 7, 2010


Be sure to understand where the theorem doesn't apply. The "Limitations" section in the Wikipedia article gives a brief account, maybe too brief.
posted by Maximian at 10:50 AM on August 7, 2010


dfriedman, maybe I am missing something, but you forgot 20, 21, 23, 24, 25, 26, 27, 28 and 29.
posted by milarepa at 10:53 AM on August 7, 2010


1,10,11,12,13,14,15,16,17,18,19,21,31,41,51,61,71,81,91: there are 19 numbers that contain the digit 1, and the digit 1 appears 20 times (two times in the case of 11).

Thus in the sequence 1 through 100, the digit 1 appears more than any other digit.


The rule is about the first digit of a number, not just either digit. Your example can be disproved by looking at numbers containing 9s:

9,19,29,39,49,59,69,79,89,90,91,92,93,94,95,96,97,98,99
posted by EndsOfInvention at 10:53 AM on August 7, 2010


Can anyone explain it in words using an example? This is really interesting.

Consider something like interest or population that grows as a percentage. Here's a very simple example starting with 100 and growing at 5% each time. Notice that it stays in the 100s for the first fifteen or so iterations, but by the second fifteen it's already in the 400s. And we should expect this, because 5 percent of 100 is 5, while 5 percent of 400 is 20, so the numbers are going to increase faster through the 400s than through the 100s, which means they spend less time hanging around in the 400s than in the 100s. Now you might think that this ever increasing rate would mean that it wouldn't favor any one particular range, but by the mere fact that 1000 is ten times 100, this means that everything starts over again each time you reach a new decade. In other words, all that fast growth builds up and then gets canceled out again when the counter rolls over to the next width.

This kind of exponential growth corresponds to drawing a straight line on a logarithmic scale, which is why the Wikipedia article starts off with an illustration of a log scale and a caption talking about picking any point on that scale.
posted by Rhomboid at 10:56 AM on August 7, 2010


Ah, drfriedman, that's not benford's law. The law applies only to the leading digit, not just any old occurrence of 1.

This is a hard thing to describe to the layman since it involves logarithms, bases, and probability.
posted by chairface at 11:01 AM on August 7, 2010


Radiolab did a great math-themed episode that covered this.
posted by O9scar at 11:05 AM on August 7, 2010 [2 favorites]


Or to put it another way, if you're at number three digits wide then it only takes an increase of 100 to get a new first digit, but as soon as you hit four digits it now takes a change of 1000 to get a new first digit, and so on.
posted by Rhomboid at 11:07 AM on August 7, 2010


Yeah I totally screwed up that explanation.
posted by dfriedman at 11:11 AM on August 7, 2010


I first heard about this from a NPR Radiolab broadcast last year. It talks first about Benford's discovery and then has an interview with a forensic accountant who uses Benford's law to check for fraud. The amateur auditor inside me found it fascinating.
posted by saffry at 11:15 AM on August 7, 2010


Thank you for the explanations. But, how do I apply it to catch fake random numbers? The wiki article said it was used to accuse Iran of fake election data. How so?
posted by JohnnyGunn at 11:44 AM on August 7, 2010


There are other methods as well - for example, many phenomena occur in a normal distribution. If you have a process that should be generating normal data, and it is not, then you again have reason to suspect fabrication. This applies with anything you know "should" happen with the data.

To see some interesting implementations of this sort of thing, check out the OKCupid blog. They often apply analyses of their data to interesting effect. The current article is devoted to finding the lies people tell by examining data!

The politics blog FiveThirtyEight also does this with polling data and such. They have actually exposed some pollsters that were making up their data or doing other unscrupulous manipulations.
posted by Earl the Polliwog at 11:50 AM on August 7, 2010


JohnnyGunn: in a series of figures measuring a real-world quantity (votes, dollars, etc), the probability of the first digit being a 1 is about 30%. Assuming you have at least a couple dozen quantities to measure, just see how many first digit 1s there are compared to 7s, 8s and 9s. If there aren't about as many 1s as 7s 8s and 9s together, the data is probably bogus.
posted by seanmpuckett at 11:56 AM on August 7, 2010


Also, when people make up data and try to be random, they tend to pick 5s and 6s much more often than 1s and 2s. Numbers in the middle seem more "random" to people. This is another trick that can be used to catch fakers.
posted by Earl the Polliwog at 12:03 PM on August 7, 2010


They also try to avoid picking the same number two or more times in a row because they think that's not random, where the opposite is true.
posted by Rhomboid at 12:17 PM on August 7, 2010


Benford's law is about first digits. The claims of fraud in the Iranian election were made based on LAST digits, which should indeed be distributed uniformly in random data. Here's a blog post by Andrew Gelman about it, which links to the paper by Beber and Scacco which analyzes the Iranian election.
posted by escabeche at 12:43 PM on August 7, 2010


A minor caveat to add to the various Benford's law comments: there's actually some evidence that (in some contexts at least) people do intuitively follow Benford's law (see this paper, for instance). So the story is more complicated than it sounds. However, there are a lot of other systematic violations of randomness that you see in human judgements (this article is a great summary, but unless you've got a subscription to Psych Review you'll probably have to take my word for it). The best explanation for it that I've heard is that people are trying to generate numbers that provide the most evidence for a random generating process, which isn't the same thing as generating random numbers (see here for a mildly technical explanation). Personally, I'd be tempted to use as many of the different checks as possible (Benford's law, too many alternations, etc), because there's no one "silver bullet" that is guaranteed to pick out human-generated randomness from actual randomness.
posted by mixing at 4:45 PM on August 7, 2010


Thank you all for your help. Special thanks to Quisp Lover (I am a Quake fan myself) for asking the question. I think I get it now.
posted by JohnnyGunn at 5:50 PM on August 7, 2010


Response by poster: "Nothing on that Safari page crashes my Safari. (5.0.1 on SL)"

I'm still on Safari 4.x, and it just hates that wikipedia page! Weird...I opened it on Firefox, and see nothing unusual in the source.
posted by Quisp Lover at 9:14 AM on August 8, 2010


« Older Seattle bar recommendation   |   Short fiction magazines Newer »
This thread is closed to new comments.