Showing posts with label probability. Show all posts
Showing posts with label probability. Show all posts

Wednesday, March 11, 2020

Benford's Law: finding fraud and data oddities

What links fraud detection, old-fashioned log tables, and error detection in data feeds? Benford’s Law provides the link and I'll show you what it is and how you might use it.

Imagine I gave you thousands of invoices and asked you to record the first digit of the amount. Out of say, 10,000 invoices, how many would you expect to start with the number 1, how many with the number 2, and so on? Naively, you might expect 1,111 to start with a 1; 1,111 to start with a 2 and so on. But that’s not what happens in the real world. 1 occurs more often than 2, which occurs more often than 3, and so on.

The Benford’s Law story starts in 1881, when Simon Newcomb, an astronomer, was using some mathematical log tables. For those of you too young to know, these are tables of the logarithms of numbers, very useful in pre-calculator days. Newcomb noticed that the pages for logarithms beginning 1 were more well-thumbed than the other pages, indicating that people were looking for the logarithms of some numbers more than others. Being an academic, he published a paper on it.

In 1938, a physicist called Frank Benford looked at a number of datasets and found the same relationship between the first digits. For example, he looked at the first digit of addresses and found that 1 occurred more frequently than 2, which occurred more frequently than 3 and so on. He didn't just look at addresses, he looked at the first digit of physical constants, the surface area of rivers, and numbers in the Reader's Digest etc. Despite being the second person to discover this relationship, the law is named after him and not Newcomb.

It turns out, we can mathematically describe Benford’s Law as:

P(d) = log(1 + (1/d))

Where d is the numbers 1 to 9 and P(d) is the probability of the number occurring. If we plot it out we get:

This means that for some datasets we expect the first digit to be one 30.1% of the time, the second digit to be two 17.6% of the time, three to be the first digit 12.5% of the time, etc.

The why of Benford’s Law is much too complex for this blog post. It was only recently (1998) proved by Hill [Hill] and involves digging into the central limit theorem and some very fundamental statistical and probability concepts.

Going back to my accounting example, it would seem all we have to do is plot the distribution for our invoice data and compare it to Benford’s Law. If there’s a difference, then there’s fraud. But the reality is, things are more complex than that.

Benford’s Law doesn’t apply everywhere, there are some conditions:

  • The data set must vary over several orders of magnitude (e.g. from 1 to 1,000)
  • The data set must have dimensions, or units. For example, Euros, or mm.
  • The mean is greater than the median and the skew is positive.

Collins provides a nice overview of how it can be used to detect accounting fraud [Collins]. But Linville [Linville] has poked some practical holes in its use. He conducted an experiment using graduate students to create fake test invoices (this was a research exercise, not an attempt at fraud!) that were mixed in with simulated invoice data. He found that if the fake invoices were less than 10% or so of the total dataset, the deviations from Benford’s Law were too small to be reliably detected.

Benford’s Law actually applies to all digits, not just the first. We can plot out an expected distribution for two digits as I’ve shown below. This has also been used for fraud detection as you might expect.

You can use Benford's Law to detect errors in incoming data. Let's say you have a datafeed of user addresses. You know the house numbers should obey Benford's Law, so you can work out the distribution the data actually has and compare it to the theoretical Benford's Law distribution. If the difference is above some threshold, you can set an alert. Bear in mind, it's not just addresses that follow the law, other properties of a data feed may too. A deviation from Benford"s Law doesn't tell you which particular items are wrong, but you do get a clue about which category, for example,  you might discover items starting with a 2 are too frequent. This is a special case of using the deviation of real data from an expected distribution as an error detection mechanism - a very useful data quality assurance method everyone should be using.

To truly understand Benford’s Law, you’ll need to dig deeply into statistics and possibly number theory, but using it is relatively straightforward. You should be aware it exists and know its limitations - especially if you’re looking for fraud.

References

[Collins] J. Carlton Collins, “Using Excel and Benford’s Law to detect fraud”, https://www.journalofaccountancy.com/issues/2017/apr/excel-and-benfords-law-to-detect-fraud.html
[Hill] Hill, T. P. "The First Digit Phenomenon." Amer. Sci. 86, 358-363, 1998.
[Linville] “The Problem Of False Negative Results In The Use Of Digit Analysis”, Mark Linville, The Journal of Applied Business Research, Volume 24, Number 1

Further reading

Wikipedia article https://en.wikipedia.org/wiki/Benford%27s_law
Mathworld article http://mathworld.wolfram.com/BenfordsLaw.html

Saturday, February 22, 2020

The Monty Hall Problem

Everyone thinks they understand probability, but every so often, something comes along that shows that maybe you don’t actually understand it at all. The Monty Hall problem is a great example of something that seems very counterintuitive and teaches us to be very wary of "common sense".

The problem got its fame from a 1990 column written by Marilyn vos Savant in Parade magazine. She posed the problem and provided the solution, but the solution seemed so counterintuitive that several math professors and many PhDs wrote to her saying she was incorrect. The discussion was so intense, it even reached the pages of the New York Times. But vos Savant was indeed correct.



(Monty Hall left (1976) - image credit: ABC Television - source Wikimedia Commons, no known copyright, Marilyn vos Savant right (2017) - image credit: Nathan Hill via Wikimedia Commons - Creative Commons License.  Note: the reason why the photos are from different years/ages is the availability of open-source images.)

The problem is loosely based on a real person and a real quiz show. In the US, there’s a long-running quiz show called ‘Let’s make a deal’, and its host for many years was Monty Hall, in whose honor the problem is named. Monty Hall was aware of the fame of the problem and had some interesting things to say about it.

Vos Savant posed the Monty Hall problem in this form:

  • A quiz show host shows a contestant three doors. Behind two of them is a goat and behind one of them is a car. The goal is to win the car.
  • The host asked the contestant to choose a door, but not open it.
  • Once the contestant has chosen a door, the host opens one of the other doors and shows the contestant a goat. The contestant now knows that there’s a goat behind that door, but he or she doesn’t know which of the other two doors the car’s behind.
  • Here’s the key question: the host asks the contestant "do you want to change doors?".
  • Once the contestant decided whether to switch or not, the host opens the contestant's chosen door and the contestant wins the car or a goat.
  • Should the contestant change doors when asked by the host? Why?

What do you think the probability of winning is if the contestant does not change doors? What do you think the probability of winning is if they do?

Here are the results.

  • If the contestant sticks with their choice, they have a ⅓ chance of winning.
  • If the contestant changes doors, they have a ⅔ chance of winning.

What?

This is probably not what you expected, so let’s investigate what’s going on.

I’m going to start with a really simple version of the game. The host shows me three doors and asks me to choose one. There’s a ⅓ probability of the car being behind my door and ⅔ probability of the car being behind the other two doors.

Now, let’s add in the host opening one of the other doors I haven’t chosen, showing me a goat, and asking me if I want to change doors. If I don’t change doors, the probability of me winning is ⅓ because I haven’t taken into account the extra information the host has given me.

What happens if I change my strategy? When I made my initial choice of doors, there was a ⅔ probability the car was behind one of the other two doors. That can't change. Whatever happens, there are still three doors and the car must be behind one of them. There’s a ⅔ probability that the car is behind one of the two doors.

Here’s where the magic happens. When the host opens a door and shows me a goat, there’s now a 0 probability that the car’s behind that door. But there was a ⅔ probability the car was behind one of the two doors before, so this must mean there’s a ⅔ probability the car is behind the remaining door!

There are more formal proofs of the correctness of this solution, but I won’t go into them here. For those of you into Bayes theorem, there’s a really nice formal proof.

I know some of you are probably completely unconvinced. I was at first too. Years ago, I wrote a simulator and did 1,000,000 simulations of the game. Guess what? Sticking gave a ⅓ probability and changing gave a ⅔ probability. You don’t even have to write a simulator anymore, there are many websites offering simulations of the game so you can try different strategies.

If you want to investigate the problem in-depth, read Rosenhouse's book. It's 174 pages on this problem alone, covering the media furor, basic probability theory, Bayes theory, and various variations of the game. It pretty much beats the problem to death.

The Monty Hall problem is a fun problem, but it does serve to illustrate a more serious point. Probability theory is often much more complex than it first appears and the truth can be counter-intuitive. The problem teaches us humility. If you’re making business decisions on multiple probabilities, are you sure you’ve correctly worked out the odds?

References

  • The Wikipedia article on the Monty Hall problem is a great place to start.
  • New York Times article about the 1990 furor with some background on the problem.
  • Washington Post article on the problem.
  • 'The Monty Hall Problem', Jason Rosenhouse - is an entire book on various aspects of the problem. It's 174 pages long but still doesn't go into some aspects of it (e.g. the quantum variation).

Sunday, February 16, 2020

Coin tossing: more interesting than you thought

Are the books right about coin tossing?

Almost every probability book and course starts with simple coin-tossing examples, but how do we know that the books are right? Has anyone tossed coins several thousand times to see what happens? Does coin-tossing actually have any relevance to business? (Spoiler alert: yes it does.) Coin tossing is boring, time-consuming, and badly paid, so there are two groups of people ideally suited to do it: prisoners and students.


(A Janus coin. Image credit: Wikimedia Commons. Public domain.)

Prisoner of war

John Kerrich was an English/South African mathematician who went to visit in-laws in Copenhagen, Denmark. Unfortunately, he was there in April 1940 when the Nazis invaded. He was promptly rounded up as an enemy national and spent the next five years in an internment camp in Jutland. Being a mathematician, he used the time well and conducted a series of probability experiments that he published after the War [Kerrich]. One of these experiments was tossing a coin 10,000 times. The results of the first 2,000 coin tosses are easily available on Stack Overflow and elsewhere, but I've not been able to find all 10,000, except in outline form.

We’re going to look at the cumulative mean of Kerrich’s data. To get this, we’ll score a head as 1 and a tail as 0. The cumulative mean is the cumulative mean of all scores we’ve seen so far; if after 100 tosses there are 55 heads then it’s 0.55 and so on. Of course, we expect to go to 0.5 ‘in the long run’, but how long is the long run? Here’s a plot of Kerrich’s data for the first 2,000 tosses

(Kerrich's coin-flipping data up to 2,000 tosses.)

I don’t have all of Kerrich’s tossing data for individual tosses, but I do have his cumulative mean results at different numbers of tosses, which I’ve reproduced below.

Number of tosses Mean Confidence interval (±)
100.40.303
200.50.219
300.5660.177
400.5250.155
500.50.139
600.4830.126
700.4570.117
800.4370.109
900.4440.103
1000.440.097
2000.490.069
3000.4860.057
4000.4980.049
5000.510.044
6000.520.040
7000.5260.037
8000.5160.035
9000.5090.033
10900.4610.030
20000.5070.022
30000.5030.018
40000.5070.015
50000.5070.014
60000.5020.013
70000.5020.012
80000.5040.011
90000.5040.010
100000.50670.009

Do you find something surprising in these results? There are at least two things I constantly need to remind myself when I’m analyzing A/B test results and simple coin-tossing serves as a good wake-up call.

The first piece is how many tosses you need to do to get reliable results. I won’t go into probability theory too much here, but suffice to say, we usually quote a range, called the confidence interval, to describe our level of certainty in a result. So a statistician won’t say 0.5, they’d say 0.5 +/- 0.04. You can unpack this to mean “I don’t know the number exactly, but I’m 95% sure it lies in the range 0.46 to 0.54”. It’s quite easy to calculate a confidence interval for an unbiased coin for different numbers of tosses. I've put the confidence interval in the table above.

The second piece is the structure of the results. Naively, you might have thought the cumulative mean would smoothly approach 0.5, but it doesn’t. The chart above shows a ‘blip’ around 100 where the results seem to change, and this kind of ‘blip’ happens very often in simulation results.

There’s a huge implication for both of these pieces. A/B tests are similar in some ways to coin tosses. The ‘blip’ reminds us we could call a result too soon and the number of tosses needed reminds us that we need to carefully calculate the expected duration of a test. In other words, we need to know what we're doing and we need to interpret results correctly.

Students

In 2009, two Berkeley undergraduates, Priscilla Ku and Janet Larwood, tossed a coin 20,000 times each and recorded the results. It took them about one hour a day for a semester. You can read about their experiment here. I've plotted their results on the chart below.

The results show a similar pattern to Kerrich’s. There’s a ‘blip’ in Priscilla's results, but the cumulative mean does tend to 0.5 in the ‘long run’ for both Janet and Priscilla.

These two are the most quoted coin-tossing results you see on the internet, but in textbooks,  Kerrich’s story is told more because it’s so colorful. However, others have spent serious time tossing coins and recording the results; they’re less famous because they only quoted the final number and didn’t give the entire dataset. In 1900, Karl Pearson reported the results of tossing a coin 24,000 times (12,012 heads), which followed on from the results of Count Buffon who tossed a coin 4,040 times (2,048 heads).

Derren Brown

I can’t leave the subject of coin tossing without mentioning Derren Brown, the English mentalist. Have a look at this YouTube video where he flips an unbiased coin heads ten times in a row. It’s all one take and there’s no trickery. Have a think about how he might have done it.

Got your ideas? Here’s how he did it; the old-fashioned way. He recorded himself flipping coins until he got ten heads in a row. It took hours.

But what if?

So far, all the experimental results match theory exactly and I expect they always will. I had a flight of fancy one day that there’s something new waiting for us out past 100,000 or 1,000,000 tosses - perhaps theory breaks down as we toss more and more. To find out if there is something there, all I need is a coin and some students or prisoners.

More technical details

I’ve put some coin tossing resources on my Github page under the coin-tossing section.

  • Kerrich is the Kerrich data set out to 2,000 tosses in detail and out to 10,000 tosses in summary. The Python code kerrich.py  displays the data in a friendly form.
  • Berkeley is the Berkeley dataset. The Python code berkeley.py reads in the data and displays it in a friendly form. The file 40000tosses.xlsx is the Excel file containing the Berkeley data.
  • coin-simulator is some Python code that shows multiple coin-tossing simulations. It's built as a Bokeh app, so you'll need to install the Bokeh module to use it.

References

[Kerrich] “An Experimental Introduction to the Theory of Probability". - Kerrich’s 1946 monograph on his wartime exploration of probability in practice.

Thursday, January 16, 2020

Correlation does not imply causation

Correlation is not causation

Because they’ve misunderstood one of the main rules of statistical evidence, I’ve seen people make serious business mistakes and damage their careers. The rule is a simple, but subtle one: correlation is not causation. I’m going to explain what this means and show you cases where it’s obviously true, and some cases where it’s less obvious. Let’s start with some definitions.

Clearly, causation means one thing causes another. For example, prolonged exposure to ultraviolet light causes sunburn, the Vibrio cholerae bacteria causes cholera, and recessions cause bankruptcies. 

What is correlation?

Correlation occurs when two things vary in the same way. For example, lung cancer rates vary with the level of smoking, commuting times vary with the state of the economy, and health and longevity are correlated with income and wealth. The relationship usually becomes clear when we plot the data out, but it’s very rarely perfect. To give you a sense of what I mean, I’ve taken the relationship between brain mass and body mass in mammals and plotted the data below, each dot is a different type of mammal [Rogel-Salazar].

The straight line on the chart is a fit to the data. As you can see, there’s a relationship between brain and body mass but the dots are spread. 

We measure how well two things are correlated with something called the correlation coefficient, r.  The closer r is to 1 (or -1), the better the correlation (this is a gross simplification). I typically look for r to be 0.8 (or < -0.8) or better.  For the brain and body data above, r is 0.89, so the correlation is ‘good’.

For causation to exist, to say that A causes B, we must be able to observe the correlation between A and B. If sunscreen is effective at reducing sunburn we should observe increased sunscreen use leading to reduced sunburn. However, we need more than correlation to prove causation (I’m skipping over details to keep it simple). 

Correlations does not imply causation

Here’s the important bit: correlation does not imply causation. Just because two things are correlated does not imply that one causes the other. Two things could be very well correlated and there could be no causal relationship between them at all. There could be a confounding factor that causes both variables to move in the same way. In my view, misunderstanding this is the single biggest problem in data analysis. 

The excellent website Spurious Correlations shows the problem in a fun way, I’ve adapted an example from the website to illustrate my point. Here are two variables I've shown varying with time. 



(Image credit: Spurious Correlations)

Imagine one of the variables was sales revenue and the other was the number of hours of sales effort. The correlation between them is very high (r=0.998). Would you say the amount of sales effort causes the sales revenue? If sales revenue was important to you, would you invest in more sales hours? If I presented this evidence to you in an executive meeting, what would you say?

Actually, I lied to you. The red line is US spending on science, space, and technology and the black line is suicides by hanging, strangulation, and suffocation. How can these things be related to each other? Because there’s some other variable or variables both of them depend on, or frankly, just by chance. Think for a minute what happens as an economy grows, all kinds of expenditure goes up; sales of expensive wine go up, and people spend more on their houses. Does that mean sales of expensive wine cause people to spend more on houses? 

(On the spurious correlations website there are a whole bunch of other examples, including: divorce rates in Maine correlated with per capita consumption of margarine, total revenue generated by arcades is correlated with the age of Miss America, and letters in the winning word of the Scripps National Spelling Bee are correlated with number of people killed by venomous spiders.)

The chart below shows the relationship between stork pairs and human births for several European locations 1980-1990 [Matthews]. Note r is high at 0.85.

Is this evidence that storks deliver babies? No. Remember correlation is not causation. There could well be many confounding variables here, for example, economic growth leading to more leisure time. Just because we don’t know what the confounding factors are doesn’t mean they don’t exist.

My other (possibly apocryphal) example concerns lice. In Europe in the middle ages, lice were considered beneficial (especially for children) because sick people didn’t have as many lice [Zinsser]. Technically, this type of causation mistake is known as the post hoc ergo propter hoc fallacy if you want to look it up.

Correlation/causation offenders

The causation/correlation problem often rears its ugly head in sales and marketing. Here are two examples I’ve seen, with the details disguised to protect the guilty.

I’ve seen a business analyst present the results of detailed sales data modeling and make recommendations for change based on the correlation/causation confusion. The sales data set was huge and they’d found a large number of correlations in the data (with good r values). They concluded that these correlations were causation, for example, in area X sales scaled with the number of sales reps and they concluded that more reps = more sales. They made a series of recommendations based on their findings. Unfortunately, most of the relationships they found were spurious and most of their recommendations and forecasts were later found to be wrong. The problem was, there were other factors at play that they hadn’t accounted for. It doesn’t matter how complicated the model or how many hours someone has put in, the same rule applies; correlation does not imply causation.

The biggest career blunder I saw was a marketing person claiming that visits to the company website were driving all company revenue, I remember them talking about the correlation and making the causation claim to get more resources for their group. Unfortunately, later on, revenue went down for reasons (genuinely) unrelated to the website. The website wasn’t driving all revenue - it was just one of a number of factors, including the economy and the product. However, their claim to be driving all revenue wasn’t forgotten by the executive team and the marketing person paid the career price.

Here’s what I think you should take away from all this. Just because two things appear to be correlated doesn’t mean there’s causation. In business, we have to make decisions on the basis of limited evidence and that’s OK. What’s not OK is to believe there’s evidence when there isn’t - specifically to infer causation from correlation. Statistics and experience teach us humility. The UK Highway Code has some good advice here, a green light doesn’t mean go, it means ‘proceed with caution'.

References

[Matthews] ‘Storks Deliver Babies (p=0.008)’, Robert Matthews, Teaching Statistics. Volume 22, Number 2, Summer 2000 
[Rogel-Salazar] Rogel-Salazar, Jesus (2015): Mammals Dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.1565651.v1 
[Zinsser] ‘Rats, lice, and history’, Hans Zinsser, Transaction Publishers, London, 2008