Showing posts with label probability. Show all posts
Showing posts with label probability. Show all posts

# Why look back at basic probability?

Bayes' theorem lies at the heart of much of modern machine learning. Although it's relatively simple to understand, you do need some grounding in probability theory. This blog post is all about getting you up close and personal with probability theory so I can tell you all about Bayes in a later post.

(You can work out the probability aliens are on earth given that Elvis lives. Image source: Pixabay Author: Pete Linforth License: Pixabay.)

# The very basics

Think of some event that might occur in the future, say winning the lottery, buying a new car, or England winning the World Cup. We can estimate the probability of these events happening; we can call the event A and the probability of the event occurring P(A). If the event is certain to occur, then P(A) =1, if it's certain not to occur, then P(A) = 0, and in all cases: 0 $$\leq$$ P(A) $$\leq$$ 1.

We'll consider the probability of several events I'm going to call A, B, C, etc. These can be any events at all, including aliens landing, Elvis making a comeback, or getting a pay raise at the end of the year.

# The complementary rule

If the probability of an event A occurring is P(A), the probability of it not occurring is $$1 - P(A)$$. This is called the complement and different authors use different notation for it:

$1 - P(A) = P(A^c) = P(A-) = P( \bar A) = P( \raise.25ex\hbox{\scriptstyle\sim} A)$

Let me give you an example using one notation. Imagine 1% of the population has a disease and 99% don't, then:

$1 = P(D+) + P(D-) = 0.01 + 0.99$

# Independence

Independence is a huge issue in probability modeling and it can lead to big errors if not handled correctly. On the face of it, it's a simple idea, but there are subtleties.

Two events are independent if one does not affect or influence the other in any way (alternatively, one event does not give any information about the other). For example, the odds of Joe Biden winning the 2020 Presidential election do not depend on the odds of New Zealand opening its borders to international travelers. Looking at things the other way, the odds of me winning the lottery are dependent on my purchasing a ticket (I have to buy a ticket to stand any chance of winning) - these are dependent events. I'm sure you can think of many other examples.

Independent and dependent events are treated very differently mathematically, the big mistake comes when events that are not independent are considered to be independent. For example, an organization might run many opinion polls in an election. The errors in the polls will not be independent of one another because the organization may well have a systemic bias that affects all their polls. There are similar problems in epidemiology; if you and I live together, my probability of catching an infectious disease is not independent of your probability of catching an infectious disease. The most famous example of confusing independent and dependent events was the subprime mortgage scandals of 2008 onwards. The analysts who developed the subprime mortgage default models assumed that mortgage defaults were independent of one another. Unfortunately for all of us, that wasn't the case in 2008. Economic conditions led to many defaults, which in turn led to broader financial problems, which in turn led to more defaults. In 2008 and onwards, sub-prime mortgage defaults were dependent on one another.

# Disjoint (mutually exclusive) events

Two events are disjoint if they're mutually exclusive, in other words, if both can't happen. For example, only one of Joe Biden or Donald Trump can win the election - they both can't be President. In notation I'll explain later: $$P(A \ and \ B) = P(A \cap B) = 0$$.

# Probability A and B occurring (intersection) - the multiplication rule

What's the probability of A and B occurring (also known as their joint or conjoint probability)? Here's where we run into some notation issues. Some sources write 'and' and some use the symbol '$$\cap$$' - both mean the same thing.

Here's the rule for dependent events:

$P(A \ and \ B) = P(A \cap B) = P(A) P(B | A)$

Here's the rule for independent events:

$P(A \ and \ B) = P(A \cap B) = P(A) P(B)$

Here's the rule for disjoint events:

$P(A \ and \ B) = P(A \cap B) = 0$

The and relationship is commutative:

$P(A \cap B) = P(B \cap A)$

# Probability of A or B occurring (union) - the addition rule

What's the probability of A or B occurring? Some sources write 'or' and some write '$$\cup$$'. Here's the rule:
$P(A \ or \ B) = P(A \cup B) = P(A) + P(B) - P(A \cap B)$

$= P(A) + P(B) - P(A)P(B | A)$

The or relationship is commutative:

$P(A \cup B) = P(B \cup A)$

For disjoint events, the addition rule simplifies to:

$P(A \ and \ B) = P(A \cup B) = P(A) + P(B)$

because from before we have:

$P(A \cap B) = 0$

# Conditional probability - the conditional rule

What's the probability I have a disease given I've tested positive for the disease? We use the | symbol to mean "given that", so P(A | B) means the probability of A happening given that B has occurred.  Here are some examples from everyday life:

• What's the probability I win the lottery given that I've bought a ticket?
• What's the probability I will get a degree if I go to college?
• What's the probability I will have an accident if I'm driving and if it's snowing and if it's dark?

The interesting thing about conditional probability is that it can be quite different from the 'raw' probability. For example, let's say you're from a poor family, you might only have a 10% chance of getting a degree, but if you get accepted to a college, the probability might shoot up to 50%, and if you actually go to college, the probability may get to 95%. The probability can change quite substantially depending on new information (as we'll see with Bayes' theorem).

The general rule is:

$P(A | B) = \frac{P(B \cap A)}{P(B)}$

If A and B are independent (A does not depend on B), then P(A | B) = P(A).

# The law of total probability

There's a general form of this law and a more specific form. Because the specific form will be useful for Bayesian work later, we'll start with that.

$P(A+) = P(A+ \cap \ B+) + P(A+ \cap \ B-)$

In words, the probability of an event A+ occurring is the probability of the event A+ occurring and the event B+ occurring plus the probability of event A+ occurring and the probability of event B+ not occurring (B-). This might be clearer if we remember $$1 = P(B+) + P(B-)$$ and we think of probabilities using a Venn diagram.

The more general form of this law is:

$P(A) = \sum_i{P(A \cap B_i)} = \sum_i{P(A | B_i)P(B_i)}$

# The law of total probability and conditional probabilities

One of the most useful forms of Bayes' theorem relies on the combination of the law of total probability and conditional probability. Here's the key relationship:

$1 = P(A | B) + P(\bar A | B)$

Let me put this into words. If event B happens, then either A or not A happens, there are no other options, so the two probabilities must sum to 1.

# What use is probability theory?

I grew up hearing about the value of 'common sense', but probability theory often gives results that seem very counterintuitive and 'common sense' can lead you wildly astray. A fun example is the Monty Hall problem, but there are lots of other examples in the real world where the probability of something happening is not what it appears to be at first - and they're not so fun. The counter-intuitive example you find most often on the internet is the probability that you have a disease given a positive test result; it's mostly not what you think.

Bayes' theorem takes us into the world of the counter-intuitive and I'll talk about Bayes in a future blog post.

# An offer you can't refuse?

Imagine you're in a casino playing craps, a game where you bet on the outcome of two dice thrown at the same time. The probability of a double six coming up is 1/36, but no one has thrown a double six for over 110 throws. The table is starting to get crowded and noisy with people betting on a double six. It's due to come up, and it must come up soon.

(Still no double six. Source: Wikimedia Commons. License: Creative Commons. Author: Gaz.)

A new player rolls the dice; snake-eyes (double ones) - still no double six.

You feel a tap on your elbow. A lady in a cocktail dress whispers to you that she'll give you odds of 20 to 1 for a double six.

Another player rolls the dice; easy-four (one and three) - the expectation for a double six mounts.

Your new friend whispers that she'll reduce the odds soon; she asks if you want to take the bet.

It's now 130 throws since a double six has occurred and it should have occurred 3 or 4 times by now.

Do you take the bet?

# The gambler's fallacy

The gambler's fallacy is the belief that the outcome of a random event is somehow influenced by previous random events. In our craps case, some examples might be:

• double six hasn't come up in 130 throws, so it's much more likely to come up now (the probability is higher than 1/36)
• double one has just come up, therefore it's not likely to come up again soon (the probability is less than 1/36).

It's a fallacy because each roll of the dice is completely independent; it doesn't matter what the previous throws were. There could have been 1,000 throws without a double six, but the probability of a double six will always be 1/36. The logic same applies to the snake-eyes example, if a snake-eyes has been thrown, the probability of throwing another snake-eyes immediately after is still 1/36.

Let me lay this out even more starkly, in craps:

• At the very first roll of the dice, the probability of a double six is 1/36.
• After ten rolls of the dice, the probability of the next roll being a double six is 1/36.
• After 100 rolls without a double six, the probability of the next roll being a double six is 1/36.
• After 200 rolls without a double six, the probability of the next roll being a double six is 1/36.
• After 1,000 rolls without a double six, the probability of the next roll being a double six is 1/36.

Otherwise rational people are fooled by the gambler's fallacy all the time. As the money increases and the emotion heightens, the gambler's fallacy becomes easier and easier to fall for, as we'll see.

# The Italian lottery

The story starts in Venice, Italy in May 2003. The Venice lottery was a game where 6 numbered balls (plus a bonus ball) were selected from a set of 90 numbered balls. The lottery was run twice a week. Each number should come out on average once every 7-8 weeks. As with all government-sponsored lotteries, the results were well-publicized.

In May 2003, the number 53 came up. Then it didn't come up again.

By October, people realized the number 53 was overdue. They started to gamble on 53 occurring - it was overdue, so it must come up. But 53 just didn't come up.

News of the 53 drought started to spread, and more and more Italians started to bet that 53 would occur, but it didn't. It didn't come up in November or December either.

In January of 2004, a woman from Carrara committed suicide because she'd spent her family's life-saving gambling that 53 would come up. It didn't.

Still, 53 didn't come up.

People went crazy betting money that 53 would come up, they became known as '53 addicts'. They were sure it must come up. Sadly, it didn't. A man from Signa shot his wife, his son, and himself after losing money gambling on 53.

Still, 53 didn't come up.

Italians gambled and lost a huge amount of money on 53, an estimated 4 billion Euros. They had fallen for the gambler's fallacy and believed that 53 must come up soon.

Eventually, 53 did come up - in February 2005, after 182 draws (remember, each draw was seven balls).

The Venice lottery made a lot of money, but the Italian gamblers did not.

# How the cocktail dress lady (and casinos) makes money

To understand if the cocktail dress lady was offering a good deal, we need to relate probability to odds.

The probability of a double six is 1/36.

The odds are the ratio of the probability the event will occur divided by the probability the event will not occur:

$odds = \frac{P}{1-P}$

The odds of a double six are:

$odds_{66} = \frac{\frac{1}{36}}{\frac{35}{36}} = \frac {1}{35}$

which a bookie might quote as 35 to 1.

Generally speaking, casinos and bookies make money in one of two ways:

• The probabilities don't add up to 1.
• They rely on the gamblers' fallacy and offer worse odds than a fair analysis would suggest.

Let's imagine there are ten horses in a race. Each horse has a 10% chance of winning, which are odds of 9 to 1. If you win, you get your stake money back, so a winning bet of $1 gives you$10. If ten punters bet $1 on each horse, the bookie takes$10, but one of the horses must win, so the bookie pays out $10. (Bookmakers make money. You don't. Image source: Wikimedia Commons. License: Creative Commons. Author: Grand Island Tourism ) To make money, the bookie reduces the odds. Instead of offering 9 to 1 on each of the horses, the bookie offers 8 to 1. The bookie still takes in$10, but this time only pays out \$9. In the real world, it's more complicated, but you get the idea.

The other way to make money is to underprice probabilities. A double six should be offered at 35 to 1, but you could offer it at 20 to 1. This is a horrible deal, but if gamblers have a bad case of the gambler's fallacy, they may be convinced the probability is much higher than 1/36 and they may even view a horrible deal as the deal of a lifetime. The casino, or the lady in the cocktail dress, makes money by knowing the odds and knowing when to offer a deal that seems attractive, but isn't.

Not only should you not accept the 20-to-1 offer, but you should also offer it to other players.

# Gambler's fallacy in Reno, Nevada and Monte Carlo

Obviously, there are naive gamblers in Las Vegas, but do people really fall for the gamblers' fallacy at the roulette table? After all, you have to have some level of sophistication to understand and play the game, so surely gamblers are savvy and know how to price bets appropriately? It seems that they don't always.

Using videotape data supplied by a casino in Reno, Nevada, two researchers tracked the pattern of gambling on roulette. If gamblers have fallen for the gambler's fallacy, you might expect to see certain patterns of betting, for example, if red hasn't come up as often as expected, they might bet more on red. The researchers found small, but significant examples of the gambler's fallacy The reality is then, there are people who fall for the fallacy, even those playing a sophisticated game like roulette.

(Image source: Wikimedia Commons. License: Creative Commons. Author: Ken Lund.)

Another object lesson in the gambler's fallacy occurred at a roulette table in a casino, this time at a casino in Monte Carlo. In 1943, the ball landed on red 32 times in a row. The people who thought black must come up were cleaned out.

# The gamblers' fallacy elsewhere

The gambler's fallacy has been an active area of research for some time. Variations of it have been found in different places:

Let's imagine you're an asylum judge. You're aware of the average 'success' rate for applicants and you don't want to be too far from the average. Let's assume that cases are randomly assigned (deserving and undeserving). By random chance, you might get a long string of deserving or undeserving cases, maybe as many as twenty in a row. The gambler's fallacy may kick in after a series of similar cases, for example, the first ten cases were deserving, so the eleventh 'must' be undeserving, as a result, you judge more harshly based on expectation.

# The gambler's fallacy in business

If you listen closely enough, you hear business people make the gambler's fallacy all the time. How often have you heard these kinds of phrases:

• We've won the last 8 contracts, so we must win the next one.
• We just failed to land the last 6 deals, so the odds of us landing the next deal are high.

Despite what people say, business can be strongly driven by belief and not rationality. If everyone needs a deal to be landed, then the collective view might become that a deal will be landed, regardless of what a realistic measure of the probabilities is.

# How to guard against the gamblers' fallacy

There's something about humanity and our (mis)understanding of statistics that makes us vulnerable to the gambler's fallacy. The best teacher might be experience. How many Italians who bet on 53 would do so again? There's some evidence that the gambler's fallacy is particularly strong when the data evolves over time, which ties in with the Italian lottery and casino examples. Perhaps the best defense is to take a step back and view the data as a whole, then make a decision away from the influence of others.

The existence of opulent casinos should be a lesson that those who understand probability can make money from those who do not.

# Polls to probabilities

How likely is it that your favorite candidate will win the election? If your candidate is ahead of their opponent by 5%, are they certain to win? What about 10%? Or if they're down by 2%, are they out of the race? Victory probabilities are related to how far ahead or behind a candidate is in the polls, but the relationship isn't a simple one and has some surprising consequences as we'll see.

# Opinion poll example

Let's imagine there's a hard-fought election between candidates A and B. A newspaper publishes an opinion poll a few days before the election:

• Candidate A: 52%
• Candidate B: 48%
• Sample size: 1,000

Should candidate A's supporters pop the champagne and candidate B's supporters start crying?

# The spread and standard error

Let's use some standard notation. From the theory of proportions, the mean and standard error for the proportion of respondents who chose A is:

$p_a = {n_a \over n}$ $\sigma_a = { \sqrt {{p_a(1-p_a)} \over n}}$

where $$n_a$$ is the number of respondents who chose A and $$n$$ is the total number of respondents. If the proportion of people who answered candidate B is $$p_b$$, then obviously, $$p_a + p_b = 1$$.

Election probability theory usually uses the spread, $$d$$, which is the difference between the candidates: $d = p_a - p_b = 2p_a - 1$ From statistics theory, the standard error of $$d$$  is: $\sigma_d = 2\sigma_a$ (these relationships are easy to prove, but a bit tedious, if anyone asks, I'll show the proof.)

Obviously, for a candidate to win, their spread, $$d$$, must be > 0.

# Everything is normal

From the central limit theorem (CLT), we know $$p_a$$ and $$p_b$$ are normally distributed, and also from the CLT, we know $$d$$ is normally distributed. The next step to probability is viewing the normal distribution for candidate A's spread. The chart below shows the normal distribution with mean $$d$$ and standard error $$\sigma_d$$.

As with most things with the normal distribution, it's easier if we transform everything to the standard normal using the transformation: $z = {(x - d) \over \sigma_d}$ The chart below is the standard normal representation of the same data.

The standard normal form of this distribution is a probability density function. We want the probability that $$d>0$$ which is the light green shaded area, so it's time to turn to the cumulative distribution function (CDF), and its complement, the complementary cumulative distribution function (CCDF).

# CDF and CCDF

The CDF gives us the probability that we will get a result less than or equal to some value I'll label $$z_c$$. We can write this as: $P(z \leq z_c) = CDF(z_c) = \phi(z_c)$ The CCDF is defined so that: $1 = P(z \leq z_c) + P(z > z_c)= CDF(z_c) + CCDF(z_c) = \phi(z_c) + \phi_c(z_c)$ Which is a long-winded way of saying the CCDF is defined as:  $CCDF(z_c) = P(z_c \gt 0) = \phi_c(z_c)$

The CDF is the integral of the PDF, and from standard textbooks: $\phi(z_c) = {1 \over 2} \left( 1 + erf\left( {z_c \over \sqrt2} \right) \right)$ We want the CCDF,  $$P(z > z_c)$$, which is simply 1 - CDF.

Our critical value occurs when the spread is zero. The transformation to the standard normal in this case is: $z_c = {(x - d) \over \sigma_d} = {-d \over \sigma_d}$ We can write the CCDF as: $\phi_c(z_c) = 1 - \phi(z_c) = 1- {1 \over 2} \left( 1 + erf\left( {z_c \over \sqrt2} \right) \right)\$ $= 1 - {1 \over 2} \left( 1 + erf\left( {-d \over {\sigma_d\sqrt2}} \right) \right)$ We can easily show that: $erf(x) = -erf(-x)$ Using this relationship, we can rewrite the above equation as: $P(d > 0) = {1 \over 2} \left( 1 + erf\left( {d \over {\sigma_d\sqrt2}} \right) \right)$

What we have is an equation that takes data we've derived from an opinion poll and gives us a probability of a candidate winning.

# Probabilities for our example

For candidate A:

• $$n=1000$$
• $$p_a = {520 \over 1000} = 0.52$$
• $$\alpha_a = 0.016$$
• $$d = {{520 - 480} \over 1000} = 0.04$$
• $$\alpha_d = 0.032$$
• $$P(d > 0) = 90\%$$

For candidate B:

• $$n=1000$$
• $$p_b = {480 \over 1000} = 0.48$$
• $$\alpha_b = 0.016$$
• $$d = {{480 - 520} \over 1000} = -0.04$$
• $$\alpha_d = 0.032$$
• $$P(d > 0) = 10\%$$

Obviously, the two probabilities add up to 1. But note the probability for candidate A. Did you expect a number like this? A 4% point lead in the polls giving a 90% chance of victory?

# Some consequences

Because the probability is based on $$erf$$, you can quite quickly get to highly probable events as I'm going to show in an example. I've plotted the probability for candidate A for various leads (spreads) in the polls. Most polls nowadays tend to have about 800 or so respondents (some are more and some are a lot less), so I've taken 800 as my poll size. Obviously, if the spread is zero, the election is 50%:50%. Note how quickly the probability of victory increases as the spread increases.

What about the size of the poll, how does that change things? Let's fix the spread to 2% and vary the size of the poll from 200 to 2,000 (the usual upper and lower bounds on poll sizes). Here's how the probability varies with poll size for a spread of 2%.

Now imagine you're a cynical and seasoned poll analyst working on candidate A's campaign. The young and excitable intern comes rushing in, shouting to everyone that A is ahead in the polls! You ask the intern two questions, and then, like the Oracle at Delphi, you predict happiness or not. What two questions do you ask?

• What's the size of the poll?

# What's missing

There are two elephants in the room, and I've been avoiding talking about them. Can you guess what they are?

All of this analysis assumes the only source of error is random noise. In other words, there's no systemic bias. In the real world, that's not true. Polls aren't wholly based on random sampling, and the sampling method can introduce bias. I haven't modeled it at all in this analysis. There are at least two systemic biases:

• Pollster house effects arising from house sampling methods
• Election effects arising from different population groups voting in different ways compared to previous elections.

Understanding and allowing for bias is key to making a successful election forecast. This is an advanced topic for another blog post.

The other missing item is more subtle. It's undecided voters. Imagine there are two elections and two opinion polls. Both polls have 1,000 respondents.

Election 1:

• Candidate A chosen by 20%
• Candidate B chosen by 10%
• Undecided voters are 70%
Election 2:

• Candidate A chosen by 55%
• Candidate B chosen by 45%
• Undecided voters are 0%
In both elections, the spread from the polls is 10%, so candidate A has the same higher chance of winning in both elections, but this doesn't seem right. Intuitively, we should be less certain about an election with a high number of undecided voters. Modeling undecided voters is a topic for another blog post!

The best source of election analysis I've read is in the book "Introduction to data science" and the associated edX course "Inference and modeling", both by Rafael Irizarry. The analysis in this blog post was culled from multiple books and websites, each of which only gave part of the story.

# Forecasting the 2020 election: a retrospectiveWhat do presidential approval polls really tell us?Fundamentally wrong? Using economic data as an election predictor - why I distrust forecasting models built on economic and other dataCan you believe the polls? - fake polls, leading questions, and other sins of opinion polling.President Hilary Clinton: what the polls got wrong in 2016 and why they got it wrong - why the polls said Clinton would win and why Trump did.Poll-axed: disastrously wrong opinion polls - a brief romp through some disastrously wrong opinion poll results.Who will win the election? Election victory probabilities from opinion pollsSampling the goods: how opinion polls are made - my experiences working for an opinion polling company as a street interviewer.The electoral college for beginners - how the electoral college works

## Wednesday, March 11, 2020

### Benford's Law: finding fraud and data oddities

What links fraud detection, old-fashioned log tables, and error detection in data feeds? Benford’s Law provides the link and I'll show you what it is and how you might use it.

Imagine I gave you thousands of invoices and asked you to record the first digit of the amount. Out of say, 10,000 invoices, how many would you expect to start with the number 1, how many with the number 2, and so on? Naively, you might expect 1,111 to start with a 1; 1,111 to start with a 2 and so on. But that’s not what happens in the real world. 1 occurs more often than 2, which occurs more often than 3, and so on.

The Benford’s Law story starts in 1881, when Simon Newcomb, an astronomer, was using some mathematical log tables. For those of you too young to know, these are tables of the logarithms of numbers, very useful in pre-calculator days. Newcomb noticed that the pages for logarithms beginning 1 were more well-thumbed than the other pages, indicating that people were looking for the logarithms of some numbers more than others. Being an academic, he published a paper on it.

In 1938, a physicist called Frank Benford looked at a number of datasets and found the same relationship between the first digits. For example, he looked at the first digit of addresses and found that 1 occurred more frequently than 2, which occurred more frequently than 3 and so on. He didn't just look at addresses, he looked at the first digit of physical constants, the surface area of rivers, and numbers in the Reader's Digest etc. Despite being the second person to discover this relationship, the law is named after him and not Newcomb.

It turns out, we can mathematically describe Benford’s Law as:

P(d) = log(1 + (1/d))

Where d is the numbers 1 to 9 and P(d) is the probability of the number occurring. If we plot it out we get:

This means that for some datasets we expect the first digit to be one 30.1% of the time, the second digit to be two 17.6% of the time, three to be the first digit 12.5% of the time, etc.

The why of Benford’s Law is much too complex for this blog post. It was only recently (1998) proved by Hill [Hill] and involves digging into the central limit theorem and some very fundamental statistical and probability concepts.

Going back to my accounting example, it would seem all we have to do is plot the distribution for our invoice data and compare it to Benford’s Law. If there’s a difference, then there’s fraud. But the reality is, things are more complex than that.

Benford’s Law doesn’t apply everywhere, there are some conditions:

• The data set must vary over several orders of magnitude (e.g. from 1 to 1,000)
• The data set must have dimensions, or units. For example, Euros, or mm.
• The mean is greater than the median and the skew is positive.

Collins provides a nice overview of how it can be used to detect accounting fraud [Collins]. But Linville [Linville] has poked some practical holes in its use. He conducted an experiment using graduate students to create fake test invoices (this was a research exercise, not an attempt at fraud!) that were mixed in with simulated invoice data. He found that if the fake invoices were less than 10% or so of the total dataset, the deviations from Benford’s Law were too small to be reliably detected.

Benford’s Law actually applies to all digits, not just the first. We can plot out an expected distribution for two digits as I’ve shown below. This has also been used for fraud detection as you might expect.

You can use Benford's Law to detect errors in incoming data. Let's say you have a datafeed of user addresses. You know the house numbers should obey Benford's Law, so you can work out the distribution the data actually has and compare it to the theoretical Benford's Law distribution. If the difference is above some threshold, you can set an alert. Bear in mind, it's not just addresses that follow the law, other properties of a data feed may too. A deviation from Benford"s Law doesn't tell you which particular items are wrong, but you do get a clue about which category, for example,  you might discover items starting with a 2 are too frequent. This is a special case of using the deviation of real data from an expected distribution as an error detection mechanism - a very useful data quality assurance method everyone should be using.

To truly understand Benford’s Law, you’ll need to dig deeply into statistics and possibly number theory, but using it is relatively straightforward. You should be aware it exists and know its limitations - especially if you’re looking for fraud.

# References

[Collins] J. Carlton Collins, “Using Excel and Benford’s Law to detect fraud”, https://www.journalofaccountancy.com/issues/2017/apr/excel-and-benfords-law-to-detect-fraud.html
[Hill] Hill, T. P. "The First Digit Phenomenon." Amer. Sci. 86, 358-363, 1998.
[Linville] “The Problem Of False Negative Results In The Use Of Digit Analysis”, Mark Linville, The Journal of Applied Business Research, Volume 24, Number 1

Wikipedia article https://en.wikipedia.org/wiki/Benford%27s_law
Mathworld article http://mathworld.wolfram.com/BenfordsLaw.html

## Saturday, February 22, 2020

### The Monty Hall Problem

Everyone thinks they understand probability, but every so often, something comes along that shows that maybe you don’t actually understand it at all. The Monty Hall problem is a great example of something that seems very counterintuitive and teaches us to be very wary of "common sense".

The problem got its fame from a 1990 column written by Marilyn vos Savant in Parade magazine. She posed the problem and provided the solution, but the solution seemed so counterintuitive that several math professors and many PhDs wrote to her saying she was incorrect. The discussion was so intense, it even reached the pages of the New York Times. But vos Savant was indeed correct.

(Monty Hall left (1976) - image credit: ABC Television - source Wikimedia Commons, no known copyright, Marilyn vos Savant right (2017) - image credit: Nathan Hill via Wikimedia Commons - Creative Commons License.  Note: the reason why the photos are from different years/ages is the availability of open-source images.)

The problem is loosely based on a real person and a real quiz show. In the US, there’s a long-running quiz show called ‘Let’s make a deal’, and its host for many years was Monty Hall, in whose honor the problem is named. Monty Hall was aware of the fame of the problem and had some interesting things to say about it.

Vos Savant posed the Monty Hall problem in this form:

• A quiz show host shows a contestant three doors. Behind two of them is a goat and behind one of them is a car. The goal is to win the car.
• The host asked the contestant to choose a door, but not open it.
• Once the contestant has chosen a door, the host opens one of the other doors and shows the contestant a goat. The contestant now knows that there’s a goat behind that door, but he or she doesn’t know which of the other two doors the car’s behind.
• Here’s the key question: the host asks the contestant "do you want to change doors?".
• Once the contestant decided whether to switch or not, the host opens the contestant's chosen door and the contestant wins the car or a goat.
• Should the contestant change doors when asked by the host? Why?

What do you think the probability of winning is if the contestant does not change doors? What do you think the probability of winning is if they do?

Here are the results.

• If the contestant sticks with their choice, they have a ⅓ chance of winning.
• If the contestant changes doors, they have a ⅔ chance of winning.

What?

This is probably not what you expected, so let’s investigate what’s going on.

I’m going to start with a really simple version of the game. The host shows me three doors and asks me to choose one. There’s a ⅓ probability of the car being behind my door and ⅔ probability of the car being behind the other two doors.

Now, let’s add in the host opening one of the other doors I haven’t chosen, showing me a goat, and asking me if I want to change doors. If I don’t change doors, the probability of me winning is ⅓ because I haven’t taken into account the extra information the host has given me.

What happens if I change my strategy? When I made my initial choice of doors, there was a ⅔ probability the car was behind one of the other two doors. That can't change. Whatever happens, there are still three doors and the car must be behind one of them. There’s a ⅔ probability that the car is behind one of the two doors.

Here’s where the magic happens. When the host opens a door and shows me a goat, there’s now a 0 probability that the car’s behind that door. But there was a ⅔ probability the car was behind one of the two doors before, so this must mean there’s a ⅔ probability the car is behind the remaining door!

There are more formal proofs of the correctness of this solution, but I won’t go into them here. For those of you into Bayes theorem, there’s a really nice formal proof.

I know some of you are probably completely unconvinced. I was at first too. Years ago, I wrote a simulator and did 1,000,000 simulations of the game. Guess what? Sticking gave a ⅓ probability and changing gave a ⅔ probability. You don’t even have to write a simulator anymore, there are many websites offering simulations of the game so you can try different strategies.

If you want to investigate the problem in-depth, read Rosenhouse's book. It's 174 pages on this problem alone, covering the media furor, basic probability theory, Bayes theory, and various variations of the game. It pretty much beats the problem to death.

The Monty Hall problem is a fun problem, but it does serve to illustrate a more serious point. Probability theory is often much more complex than it first appears and the truth can be counter-intuitive. The problem teaches us humility. If you’re making business decisions on multiple probabilities, are you sure you’ve correctly worked out the odds?

# References

• The Wikipedia article on the Monty Hall problem is a great place to start.
• New York Times article about the 1990 furor with some background on the problem.
• Washington Post article on the problem.
• 'The Monty Hall Problem', Jason Rosenhouse - is an entire book on various aspects of the problem. It's 174 pages long but still doesn't go into some aspects of it (e.g. the quantum variation).

# Are the books right about coin tossing?

Almost every probability book and course starts with simple coin-tossing examples, but how do we know that the books are right? Has anyone tossed coins several thousand times to see what happens? Does coin-tossing actually have any relevance to business? (Spoiler alert: yes it does.) Coin tossing is boring, time-consuming, and badly paid, so there are two groups of people ideally suited to do it: prisoners and students.

(A Janus coin. Image credit: Wikimedia Commons. Public domain.)

# Prisoner of war

John Kerrich was an English/South African mathematician who went to visit in-laws in Copenhagen, Denmark. Unfortunately, he was there in April 1940 when the Nazis invaded. He was promptly rounded up as an enemy national and spent the next five years in an internment camp in Jutland. Being a mathematician, he used the time well and conducted a series of probability experiments that he published after the War [Kerrich]. One of these experiments was tossing a coin 10,000 times. The results of the first 2,000 coin tosses are easily available on Stack Overflow and elsewhere, but I've not been able to find all 10,000, except in outline form.

We’re going to look at the cumulative mean of Kerrich’s data. To get this, we’ll score a head as 1 and a tail as 0. The cumulative mean is the cumulative mean of all scores we’ve seen so far; if after 100 tosses there are 55 heads then it’s 0.55 and so on. Of course, we expect to go to 0.5 ‘in the long run’, but how long is the long run? Here’s a plot of Kerrich’s data for the first 2,000 tosses

(Kerrich's coin-flipping data up to 2,000 tosses.)

I don’t have all of Kerrich’s tossing data for individual tosses, but I do have his cumulative mean results at different numbers of tosses, which I’ve reproduced below.

Number of tosses Mean Confidence interval (±)
100.40.303
200.50.219
300.5660.177
400.5250.155
500.50.139
600.4830.126
700.4570.117
800.4370.109
900.4440.103
1000.440.097
2000.490.069
3000.4860.057
4000.4980.049
5000.510.044
6000.520.040
7000.5260.037
8000.5160.035
9000.5090.033
10900.4610.030
20000.5070.022
30000.5030.018
40000.5070.015
50000.5070.014
60000.5020.013
70000.5020.012
80000.5040.011
90000.5040.010
100000.50670.009

Do you find something surprising in these results? There are at least two things I constantly need to remind myself when I’m analyzing A/B test results and simple coin-tossing serves as a good wake-up call.

The first piece is how many tosses you need to do to get reliable results. I won’t go into probability theory too much here, but suffice to say, we usually quote a range, called the confidence interval, to describe our level of certainty in a result. So a statistician won’t say 0.5, they’d say 0.5 +/- 0.04. You can unpack this to mean “I don’t know the number exactly, but I’m 95% sure it lies in the range 0.46 to 0.54”. It’s quite easy to calculate a confidence interval for an unbiased coin for different numbers of tosses. I've put the confidence interval in the table above.

The second piece is the structure of the results. Naively, you might have thought the cumulative mean would smoothly approach 0.5, but it doesn’t. The chart above shows a ‘blip’ around 100 where the results seem to change, and this kind of ‘blip’ happens very often in simulation results.

There’s a huge implication for both of these pieces. A/B tests are similar in some ways to coin tosses. The ‘blip’ reminds us we could call a result too soon and the number of tosses needed reminds us that we need to carefully calculate the expected duration of a test. In other words, we need to know what we're doing and we need to interpret results correctly.

# Students

In 2009, two Berkeley undergraduates, Priscilla Ku and Janet Larwood, tossed a coin 20,000 times each and recorded the results. It took them about one hour a day for a semester. You can read about their experiment here. I've plotted their results on the chart below.

The results show a similar pattern to Kerrich’s. There’s a ‘blip’ in Priscilla's results, but the cumulative mean does tend to 0.5 in the ‘long run’ for both Janet and Priscilla.

These two are the most quoted coin-tossing results you see on the internet, but in textbooks,  Kerrich’s story is told more because it’s so colorful. However, others have spent serious time tossing coins and recording the results; they’re less famous because they only quoted the final number and didn’t give the entire dataset. In 1900, Karl Pearson reported the results of tossing a coin 24,000 times (12,012 heads), which followed on from the results of Count Buffon who tossed a coin 4,040 times (2,048 heads).

# Derren Brown

I can’t leave the subject of coin tossing without mentioning Derren Brown, the English mentalist. Have a look at this YouTube video where he flips an unbiased coin heads ten times in a row. It’s all one take and there’s no trickery. Have a think about how he might have done it.

Got your ideas? Here’s how he did it; the old-fashioned way. He recorded himself flipping coins until he got ten heads in a row. It took hours.

# But what if?

So far, all the experimental results match theory exactly and I expect they always will. I had a flight of fancy one day that there’s something new waiting for us out past 100,000 or 1,000,000 tosses - perhaps theory breaks down as we toss more and more. To find out if there is something there, all I need is a coin and some students or prisoners.

# More technical details

I’ve put some coin tossing resources on my Github page under the coin-tossing section.

• Kerrich is the Kerrich data set out to 2,000 tosses in detail and out to 10,000 tosses in summary. The Python code kerrich.py  displays the data in a friendly form.
• Berkeley is the Berkeley dataset. The Python code berkeley.py reads in the data and displays it in a friendly form. The file 40000tosses.xlsx is the Excel file containing the Berkeley data.
• coin-simulator is some Python code that shows multiple coin-tossing simulations. It's built as a Bokeh app, so you'll need to install the Bokeh module to use it.

# References

[Kerrich] “An Experimental Introduction to the Theory of Probability". - Kerrich’s 1946 monograph on his wartime exploration of probability in practice.

# Correlation is not causation

Because they’ve misunderstood one of the main rules of statistical evidence, I’ve seen people make serious business mistakes and damage their careers. The rule is a simple, but subtle one: correlation is not causation. I’m going to explain what this means and show you cases where it’s obviously true, and some cases where it’s less obvious. Let’s start with some definitions.

Clearly, causation means one thing causes another. For example, prolonged exposure to ultraviolet light causes sunburn, the Vibrio cholerae bacteria causes cholera, and recessions cause bankruptcies.

# What is correlation?

Correlation occurs when two things vary in the same way. For example, lung cancer rates vary with the level of smoking, commuting times vary with the state of the economy, and health and longevity are correlated with income and wealth. The relationship usually becomes clear when we plot the data out, but it’s very rarely perfect. To give you a sense of what I mean, I’ve taken the relationship between brain mass and body mass in mammals and plotted the data below, each dot is a different type of mammal [Rogel-Salazar].

The straight line on the chart is a fit to the data. As you can see, there’s a relationship between brain and body mass but the dots are spread.

We measure how well two things are correlated with something called the correlation coefficient, r.  The closer r is to 1 (or -1), the better the correlation (this is a gross simplification). I typically look for r to be 0.8 (or < -0.8) or better.  For the brain and body data above, r is 0.89, so the correlation is ‘good’.

For causation to exist, to say that A causes B, we must be able to observe the correlation between A and B. If sunscreen is effective at reducing sunburn we should observe increased sunscreen use leading to reduced sunburn. However, we need more than correlation to prove causation (I’m skipping over details to keep it simple).

# Correlations does not imply causation

Here’s the important bit: correlation does not imply causation. Just because two things are correlated does not imply that one causes the other. Two things could be very well correlated and there could be no causal relationship between them at all. There could be a confounding factor that causes both variables to move in the same way. In my view, misunderstanding this is the single biggest problem in data analysis.

The excellent website Spurious Correlations shows the problem in a fun way, I’ve adapted an example from the website to illustrate my point. Here are two variables I've shown varying with time.

(Image credit: Spurious Correlations)

Imagine one of the variables was sales revenue and the other was the number of hours of sales effort. The correlation between them is very high (r=0.998). Would you say the amount of sales effort causes the sales revenue? If sales revenue was important to you, would you invest in more sales hours? If I presented this evidence to you in an executive meeting, what would you say?

Actually, I lied to you. The red line is US spending on science, space, and technology and the black line is suicides by hanging, strangulation, and suffocation. How can these things be related to each other? Because there’s some other variable or variables both of them depend on, or frankly, just by chance. Think for a minute what happens as an economy grows, all kinds of expenditure goes up; sales of expensive wine go up, and people spend more on their houses. Does that mean sales of expensive wine cause people to spend more on houses?

(On the spurious correlations website there are a whole bunch of other examples, including: divorce rates in Maine correlated with per capita consumption of margarine, total revenue generated by arcades is correlated with the age of Miss America, and letters in the winning word of the Scripps National Spelling Bee are correlated with number of people killed by venomous spiders.)

The chart below shows the relationship between stork pairs and human births for several European locations 1980-1990 [Matthews]. Note r is high at 0.85.

Is this evidence that storks deliver babies? No. Remember correlation is not causation. There could well be many confounding variables here, for example, economic growth leading to more leisure time. Just because we don’t know what the confounding factors are doesn’t mean they don’t exist.

My other (possibly apocryphal) example concerns lice. In Europe in the middle ages, lice were considered beneficial (especially for children) because sick people didn’t have as many lice [Zinsser]. Technically, this type of causation mistake is known as the post hoc ergo propter hoc fallacy if you want to look it up.

# Correlation/causation offenders

The causation/correlation problem often rears its ugly head in sales and marketing. Here are two examples I’ve seen, with the details disguised to protect the guilty.

I’ve seen a business analyst present the results of detailed sales data modeling and make recommendations for change based on the correlation/causation confusion. The sales data set was huge and they’d found a large number of correlations in the data (with good r values). They concluded that these correlations were causation, for example, in area X sales scaled with the number of sales reps and they concluded that more reps = more sales. They made a series of recommendations based on their findings. Unfortunately, most of the relationships they found were spurious and most of their recommendations and forecasts were later found to be wrong. The problem was, there were other factors at play that they hadn’t accounted for. It doesn’t matter how complicated the model or how many hours someone has put in, the same rule applies; correlation does not imply causation.

The biggest career blunder I saw was a marketing person claiming that visits to the company website were driving all company revenue, I remember them talking about the correlation and making the causation claim to get more resources for their group. Unfortunately, later on, revenue went down for reasons (genuinely) unrelated to the website. The website wasn’t driving all revenue - it was just one of a number of factors, including the economy and the product. However, their claim to be driving all revenue wasn’t forgotten by the executive team and the marketing person paid the career price.

Here’s what I think you should take away from all this. Just because two things appear to be correlated doesn’t mean there’s causation. In business, we have to make decisions on the basis of limited evidence and that’s OK. What’s not OK is to believe there’s evidence when there isn’t - specifically to infer causation from correlation. Statistics and experience teach us humility. The UK Highway Code has some good advice here, a green light doesn’t mean go, it means ‘proceed with caution'.