Imagine I gave you thousands of invoices and asked you to record the first digit of the amount. Out of say, 10,000 invoices, how many would you expect to start with the number 1, how many with the number 2, and so on? Naively, you might expect 1,111 to start with a 1, 1,111 to start with a 2 and so on. But that’s not what happens in the real world. 1 occurs more often than 2, which occurs more often than 3, and so on.
The Benford’s Law story starts in 1881, when Simon Newcomb, an astronomer, was using some mathematical log tables. For those of you too young to know, these are tables of the logarithms of numbers, very useful in pre-calculator days. Newcomb noticed that the pages for logarithms beginning 1 were more well-thumbed than the other pages, indicating that people were looking for the logarithms of some numbers more than others. Being an academic, he published a paper on it.
In 1938, a physicist called Frank Benford looked at a number of datasets and found the same relationship between the first digits. For example, he looked at the first digit of addresses and found that 1 occurred more frequently than 2, which occurred more frequently than 3 and so on. He didn't just look at addresses, he looked at the first digit of physical constants, the surface area of rivers, and numbers in the Reader's Digest etc. Despite being the second person to discover this relationship, the law is named after him and not Newcomb.
It turns out, we can mathematically describe Benford’s Law as:
P(d) = log(1 + (1/d))
Where d is the numbers 1 to 9 and P(d) is the probability of the number occurring. If we plot it out we get:
This means that for some datasets we expect the first digit to be 1 30.1% of the time, the second digit to be 2 17.6% of the time, 3 to be the first digit 12.5% of the time, etc.The why of Benford’s Law is much too complex for this blog post. It was only recently (1998) proved by Hill [Hill] and involves digging into the central limit theorem and some very fundamental statistical and probability concepts.
Going back to my accounting example, it would seem all we have to do is plot the distribution for our invoice data and compare it to Benford’s Law. If there’s a difference, then there’s fraud. But the reality is, things are more complex than that.
Benford’s Law doesn’t apply everywhere, there are some conditions:
- The data set must vary over several orders of magnitude (e.g. from 1 to 1,000)
- The data set must have dimensions, or units. For example, Euros, or mm.
- The mean is greater than the median and the skew is positive.
Benford’s Law actually applies to all digits, not just the first. We can plot out an expected distribution for two digits as I’ve shown below. This has also been used for fraud detection as you might expect.
To truly understand Benford’s Law, you’ll need to dig deeply into statistics and possibly number theory, but using it is relatively straightforward. You should be aware it exists and know its limitations - especially if you’re looking for fraud.
References[Collins] J. Carlton Collins, “Using Excel and Benford’s Law to detect fraud”, https://www.journalofaccountancy.com/issues/2017/apr/excel-and-benfords-law-to-detect-fraud.html
[Hill] Hill, T. P. "The First Digit Phenomenon." Amer. Sci. 86, 358-363, 1998.
[Linville] “The Problem Of False Negative Results In The Use Of Digit Analysis”, Mark Linville, The Journal of Applied Business Research, Volume 24, Number 1
Further readingWikipedia article https://en.wikipedia.org/wiki/Benford%27s_law
Mathworld article http://mathworld.wolfram.com/BenfordsLaw.html