Showing posts with label bayes. Show all posts
Showing posts with label bayes. Show all posts

# What is Bayes' theorem and why is it so important?

Bayes' theorem is one of the key ideas of modern data science; it's enabling more accurate forecasting, it's leading to shorter A/B tests, and it's fundamentally changing statistical practices. In the last twenty years, Bayes' theorem has gone from being a cute probability idea to becoming central to many disciplines. Despite its huge impact, it's a simple statement of probabilities: what is the probability of an event occurring given some other event has occurred? How can something almost trivial be so revolutionary? Why all this change now? In this blog post, I'm going to give you a brief introduction to Bayes' theorem and show you why it's so powerful.

(Bayes theorem. Source: Wikimedia Commons. Author: Matt Buck. License: Creative Commons.)

# A disease example without explicitly using Bayes' theorem

To get going, I want to give you a motivating example that shows you the need for Bayes' theorem. I'm using this problem to introduce the language we'll need. I'll be using basic probability theory to solve this problem and you can find all the theory you need in my previous blog post on probability. This example is adapted from Wayne W. LaMorte's page at BU; he has some great material on probability and it's well worth your time browsing his pages.

Imagine there's a town of 10,000 people. 1% of the town's population has a disease. Fortunately, there's a very good test for the disease:

• If you have the disease, the test will give a positive result 99% of the time (sensitivity).
• If you don't have the disease, the test will give a negative result 99% of the time (specificity).

You go into the clinic one day and take the test. You get a positive result. What's the probability you have the disease? Before you go on, think about your answer and the why behind it.

• D+ and D- represent having the disease and not having the disease
• T+ and T- represent testing positive and testing negative
• P(D+) represents the probability of having the disease (with similar meanings for P(D-), P(T+), P(T-))
• P(T+ | D+) is the probability of testing positive given that you have the disease.

We can write out what we know so far:

• P(D+) = 0.01
• P(T+ | D+) = 0.99
• P(T- | D-) = 0.99

We want to know P(D+ | T+). I'm going to build a decision tree to calculate what I need.

There are 10,000 people in the town, and 1% of them have the disease. We can draw this in a tree diagram like so.

For each of the branches, D+ and D-, we can draw branches that show the test results T+ and T-:

For example, we know 100 people have the disease, of whom 99% will test positive, which means 1% will test negative. Similarly, for those who do not have the disease, (9,900), 99% will test negative (9,801), and 1% will test positive (99).

Out of 198 people who tested positive for the disease (P(T+) = P(T+ | D+) + P(T+ | D-)), 99 people have it, so P(D+ | T+) = 99/198. In other words, if I test positive for the disease, I have a 50% chance of actually having it.

There are two takeaways from all of this:

• Wow! Really, only a 50% probability! I thought it would be much higher! (This is called the base rate fallacy).
• This is a really tedious process and probably doesn't scale. Can we do better? (Yes: Bayes' theorem.)

# Who was Bayes?

Thomas Bayes (1702-1761), was an English non-conformist minister (meaning a protestant minister not part of the established Church of England). His religious duties left him time for mathematical exploration, which he did for his own pleasure and amusement; he never published in his lifetime in his own name. After his death, his friend and executor, Richard Price, went through his papers and found an interesting result, which we now call Bayes' theorem.  Price presented it at the Royal Society and the result was shared with the mathematical community.

(Plaque commemorating Thomas Bayes. Source: Wikimedia Commons Author:Simon Harriyott License: Creative Commons.)

For those of you who live in London, or visit London, you can visit the Thomas Bayes memorial in the historic Bunhill Cemetery where Bayes is buried. For the true probability pilgrim, it might also be worth visiting Richard Price's grave which is only a short distance away.

# Bayes' theorem

The derivation of Bayes' theorem is almost trivial. From basic probability theory:

$P(A \cap B) = P(A) P(B | A)$
$P(A \cap B) = P(B \cap A)$

With some re-arranging we get the infamous theorem:

$P(A | B) = \frac{P(B | A) P(A)}{P(B)}$

Although this is the most compact version of the theorem, it's more usefully written as:

$P(A | B) = \frac{P(B | A) P(A)}{P(B \cap A) + P(B \cap \bar A)} = \frac{P(B | A)P(A)}{P(B | A)P(A) + P(B | \bar A) P( \bar A)}$

where $$\bar A$$ means not A (remember $$1 = P(A) + P(\bar A)$$). You can get this second form of Bayes using the law of total probability and the multiplication rule (see my previous blog post).

So what does it all mean and why is there so much excitement over something so trivial?

# What does Bayes' theorem mean?

The core idea of Bayesian statistics is that we update our prior beliefs as new data becomes available - we go from the prior to the posterior. This process is often iterative and is called the diachronic interpretation of Bayes theorem. It usually requires some computation; something that's reasonable to do given today's computing power and the free availability of numeric computing languages. This form of Bayes is often written:

$P(H | D) = \frac{P(D | H) P(H)}{P(D)}$

with these definitions:

• P(H) - the probability of the hypothesis before the new data - often called the prior
• P(H | D) - the probability of the hypothesis after the data - the posterior
• P(D | H) - the probability of the data under the hypothesis, the likelihood
• P(D) - the probability of the data, it's called the normalizing constant

A good example of the use of Bayes' theorem is its use to better quantify the health risk an individual faces from a disease. Let's say the risk of suffering a heart attack in any year is P(HA), however, this is for the population as a whole (the prior). If someone smokes, the probability becomes P(HA | S), which is the posterior, which may be considerably different from P(HA).

Let's use some examples to figure out how Bayes works in practice.

# The disease example using Bayes

Let's start from this version of Bayes:

$P(A | B) = \frac{P(B | A)P(A)}{P(B | A)P(A) + P(B | \bar A) P( \bar A)}$

and use the notation from our disease example:

$P(D+ | T+) = \frac{P(T+ | D+)P(D+)}{P(T+ | D+)P(D+) + P(T+ | D-) P( D-)}$

Here's what we know from our previous disease example:

• P(D+) = 0.01 and by implication P(D-) = 0.99
• P(T+ | D+) = 0.99
• P(T- | D-) = 0.99 and by implication P(T+ | D-) = 0.01

Plugging in the numbers:

$P(D+ | T+) = \frac{0.99\times0.01}{0.99\times0.01 + 0.01\times0.99} = 0.5$

The decision tree is easier for a human to understand, but if there are a large number of conditions, it becomes much harder to use. For a computer on the other hand, the Bayes solution is straightforward to code and it's expandable for a large number of conditions.

# Predicting US presidential election results

• To predict a winner, you need to model the electoral college, which implies a state-by-state forecast.
• For each state, you know who won last time, so you have a prior in the Bayesian sense.
• In competitive states, there are a number of opinion polls that provide evidence of voter intention, this is the data or normalizing constant in Bayes-speak.

In practice, you start with a state-by-state prior based on previous elections or fundamentals, or something else. As opinion polls are published, you calculate a posterior probability for each of the parties to win the state election. Of course, you do this with Bayes theorem. As more polls come in, you update your model and the influence of your prior becomes less and less. In some versions of this type of modeling work, models take into account national polling trends too.

The landmark paper describing this type of modeling is by Linzer.

# Using Bayes' theorem to prove the existence of God

Over history, there have been many attempts to prove the existence of God using scientific or mathematical methods. All of them have floundered for one reason or another. Interestingly, one of the first uses of Bayes' theorem was to try and prove the existence of God by proving miracles can happen. The argument was put forward by Richard Price himself. I'm going to repeat his analysis using modern notation, based on an explanation from Cornell University.

Price's argument is based on tides. We expect tides to happen every day, but if a tide doesn't happen, that would be a miracle. If T is the consistency of tides, and M is a miracle (no tide), then we can use Bayes theorem as:

$P(M | T) = \frac{P(T | M) P(M)}{P(T | M) P(M) + P(T | \bar M) P(\bar M)}$

Price assumed the probability of miracles existing was the same as the probability of miracles not existing (!), so $$P(M) = P(\bar M)$$. If we plug this into the equation above and simplify, we get:

$P(M | T) = \frac{P(T | M)}{P(T | M) + P(T | \bar M)}$

He further assumed that if miracles exist, they would be very rare (or we would see them all the time), so:

$P(T | \bar M) >> P(T | M)$

he further assumed that $$P(T | M) = 1e^{-6}$$ - in other words, if a miracle exists, it would happen 1 time in 1 million. He also assumed that if there were no miracles, tides would always happen, so $$P(T | \bar M) = 1$$. The upshot of all this is that:

$P(M | T) = 0.000001$

or, there's a 1 in a million chance of a miracle happening.

There are more holes in this argument than in a teabag, but it is an interesting use of Bayes' theorem and does give you some indication of how it might be used to solve other problems.

# Monty Hall and Bayes

The Monty Hall problem has tripped people up for decades (see my previous post on the problem). Using Bayes' theorem, we can rigorously solve it.

Here's the problem. You're on a game show hosted by Monty Hall and your goal is to win the car. He shows you three doors and asks you to choose one. Behind two of the doors are goats and behind one of the doors is a car. Once you've chosen your door, Monty opens one of the other doors to show you what's behind it. He always chooses a door with a goat behind it. Next, he asks you the key question: "do you want to change doors?". Should you change doors and why?

I'm going to use the diachronic interpretation of Bayes theorem to figure out what happens if we don't change:

$P(H | D) = \frac{P(D | H) P(H)}{P(D)} = \frac{P(D | H) P(H)}{P(D | H)P(H) + P(D | \bar H) P( \bar D)}$
• $$P(H)$$ is the probability our initial choice of door has a car behind it, which is $$\frac{1}{3}$$.
• $$P( \bar H) = 1- P(H) = \frac{2}{3}$$
• $$P(D | H) = 1$$ this is the probability Monty will show me a door with a goat given that I have chosen the door with a car - it's always 1 because Monty always shows me the door with a goat
• $$P(D | \bar H) = 1$$ this is the probability Monty will show me a door with a goat given that I have chosen the door with a goat - it's always 1 because Monty always shows me the door with a goat,

Plugging these numbers in:

$P(H | D) = \frac{1 \times \frac{1}{3}}{1 \times \frac{1}{3} + 1 \times \frac{2}{3}} = \frac{1}{3}$

If we don't change, then the probability of winning is the same as if Monty hadn't opened the other door. But there are only two doors, and $$P(\bar H) + P(H) = 1$$. In turn, this means our winning probability if we switch is $$\frac{2}{3}$$, so our best strategy is switching.

# Searching for crashed planes and shipwrecks

On 1st June 2009, Air France Flight AF 447 crashed into the Atlantic. Although the flight had been tracked, the underwater search for the plane was complex. The initial search used Bayesian inference to try and locate where on the ocean floor the plane might be. It used data from previous crashes that assumed the underwater locator beacon was working. Sadly, the initial search didn't find the plane.

In 2011, a new team re-examined the data, with two crucial differences. Firstly, they had data from the first search, and secondly, they assumed the underwater locator beacon had failed. Again using Bayesian inference, they pointed to an area of ocean that had already been searched. The ocean was searched again (with the assumption the underwater beacon had failed), and this time the plane was found.

You can read more about this story in the MIT Technology Review and for more in-depth details, you can read the paper by the team that did the analysis.

It turns out, there's quite a long history of analysts using Bayes theorem to locate missing ships. In this 1971 paper, Richardson and Stone show how it was used to locate the wreckage of the USS Scorpion. Since then, a number of high-profile wrecks have been located using similar methods.

Sadly, even Bayes' theorem hasn't led to anyone finding flight MH370.

# Other examples of Bayes' theorem

Bayes has been applied in many, many disciplines. I'm not going to give you an exhaustive list, but I will give you some of the more 'fun' ones.

# Why now?

Using Bayes theorem can involve a lot of fairly tedious arithmetic. If the problem requires many iterations, there are lots of tedious calculations. This held up the adoption of Bayesian methods until three things happened:

• Cheap computing.
• The free and easy availability of mathematical computing languages.
• Widespread skill to program in these languages.

By the late 1980s, computing power was sufficiently cheap to make Bayesian methods viable, and of course, computing has only gotten cheaper since then. Good quality mathematical languages were available by the late 1980s too (e.g. Fortran, MATLAB), but by the 2010s, Python and R had all the necessary functionality and were freely and easily available. Both Python and R usage had been growing for a while, but by the 2010s, there was a very large pool of people who were fluent in them.

As they say in murder mysteries, by the 2010s, Bayesian methods had the means, the motive, and the opportunity.

# Bayes and the remaking of statistics

Traditional (non-Bayesian) statistics are usually called frequentist statistics. It has a long history and has been very successful, but it has problems. In the last 50 years, Bayesian analysis has become more successful and is now challenging frequentist statistics.

I'm not going to provide an in-depth critique of frequentist statistics here, but I will give you a high-level summary of some of the problems.

• p-values and significance levels are prone to misunderstandings - and the choice of significance levels is arbitrary
• Much of the language surrounding statistical tests is complex and rests on convention rather than underlying theory
• The null hypothesis test is frequently misunderstood and misinterpreted
• Prior information is mostly ignored.

Bayesian methods help put statistics on a firmer intellectual foundation, but the price is changing well-understood and working frequentist statistics. In my opinion, over the next twenty years, we'll see Bayesian methods filter down to undergraduate level and gradually replace the frequentist approach. But for right now, the frequentists rule.

# Conclusion

At its heart, Bayes' theorem is almost trivial, but it's come to represent a philosophy and approach to statistical analysis that modern computing has enabled; it's about updating your beliefs with new information. A welcome side-effect is that it's changing statistical practice and putting it on a firmer theoretical foundation. Widespread change to Bayesian methods will take time, however, especially because frequentist statistics are so successful.

# Why look back at basic probability?

Bayes' theorem lies at the heart of much of modern machine learning. Although it's relatively simple to understand, you do need some grounding in probability theory. This blog post is all about getting you up close and personal with probability theory so I can tell you all about Bayes in a later post.

(You can work out the probability aliens are on earth given that Elvis lives. Image source: Pixabay Author: Pete Linforth License: Pixabay.)

# The very basics

Think of some event that might occur in the future, say winning the lottery, buying a new car, or England winning the World Cup. We can estimate the probability of these events happening; we can call the event A and the probability of the event occurring P(A). If the event is certain to occur, then P(A) =1, if it's certain not to occur, then P(A) = 0, and in all cases: 0 $$\leq$$ P(A) $$\leq$$ 1.

We'll consider the probability of several events I'm going to call A, B, C, etc. These can be any events at all, including aliens landing, Elvis making a comeback, or getting a pay raise at the end of the year.

# The complementary rule

If the probability of an event A occurring is P(A), the probability of it not occurring is $$1 - P(A)$$. This is called the complement and different authors use different notation for it:

$1 - P(A) = P(A^c) = P(A-) = P( \bar A) = P( \raise.25ex\hbox{\scriptstyle\sim} A)$

Let me give you an example using one notation. Imagine 1% of the population has a disease and 99% don't, then:

$1 = P(D+) + P(D-) = 0.01 + 0.99$

# Independence

Independence is a huge issue in probability modeling and it can lead to big errors if not handled correctly. On the face of it, it's a simple idea, but there are subtleties.

Two events are independent if one does not affect or influence the other in any way (alternatively, one event does not give any information about the other). For example, the odds of Joe Biden winning the 2020 Presidential election do not depend on the odds of New Zealand opening its borders to international travelers. Looking at things the other way, the odds of me winning the lottery are dependent on my purchasing a ticket (I have to buy a ticket to stand any chance of winning) - these are dependent events. I'm sure you can think of many other examples.

Independent and dependent events are treated very differently mathematically, the big mistake comes when events that are not independent are considered to be independent. For example, an organization might run many opinion polls in an election. The errors in the polls will not be independent of one another because the organization may well have a systemic bias that affects all their polls. There are similar problems in epidemiology; if you and I live together, my probability of catching an infectious disease is not independent of your probability of catching an infectious disease. The most famous example of confusing independent and dependent events was the subprime mortgage scandals of 2008 onwards. The analysts who developed the subprime mortgage default models assumed that mortgage defaults were independent of one another. Unfortunately for all of us, that wasn't the case in 2008. Economic conditions led to many defaults, which in turn led to broader financial problems, which in turn led to more defaults. In 2008 and onwards, sub-prime mortgage defaults were dependent on one another.

# Disjoint (mutually exclusive) events

Two events are disjoint if they're mutually exclusive, in other words, if both can't happen. For example, only one of Joe Biden or Donald Trump can win the election - they both can't be President. In notation I'll explain later: $$P(A \ and \ B) = P(A \cap B) = 0$$.

# Probability A and B occurring (intersection) - the multiplication rule

What's the probability of A and B occurring (also known as their joint or conjoint probability)? Here's where we run into some notation issues. Some sources write 'and' and some use the symbol '$$\cap$$' - both mean the same thing.

Here's the rule for dependent events:

$P(A \ and \ B) = P(A \cap B) = P(A) P(B | A)$

Here's the rule for independent events:

$P(A \ and \ B) = P(A \cap B) = P(A) P(B)$

Here's the rule for disjoint events:

$P(A \ and \ B) = P(A \cap B) = 0$

The and relationship is commutative:

$P(A \cap B) = P(B \cap A)$

# Probability of A or B occurring (union) - the addition rule

What's the probability of A or B occurring? Some sources write 'or' and some write '$$\cup$$'. Here's the rule:
$P(A \ or \ B) = P(A \cup B) = P(A) + P(B) - P(A \cap B)$

$= P(A) + P(B) - P(A)P(B | A)$

The or relationship is commutative:

$P(A \cup B) = P(B \cup A)$

For disjoint events, the addition rule simplifies to:

$P(A \ and \ B) = P(A \cup B) = P(A) + P(B)$

because from before we have:

$P(A \cap B) = 0$

# Conditional probability - the conditional rule

What's the probability I have a disease given I've tested positive for the disease? We use the | symbol to mean "given that", so P(A | B) means the probability of A happening given that B has occurred.  Here are some examples from everyday life:

• What's the probability I win the lottery given that I've bought a ticket?
• What's the probability I will get a degree if I go to college?
• What's the probability I will have an accident if I'm driving and if it's snowing and if it's dark?

The interesting thing about conditional probability is that it can be quite different from the 'raw' probability. For example, let's say you're from a poor family, you might only have a 10% chance of getting a degree, but if you get accepted to a college, the probability might shoot up to 50%, and if you actually go to college, the probability may get to 95%. The probability can change quite substantially depending on new information (as we'll see with Bayes' theorem).

The general rule is:

$P(A | B) = \frac{P(B \cap A)}{P(B)}$

If A and B are independent (A does not depend on B), then P(A | B) = P(A).

# The law of total probability

There's a general form of this law and a more specific form. Because the specific form will be useful for Bayesian work later, we'll start with that.

$P(A+) = P(A+ \cap \ B+) + P(A+ \cap \ B-)$

In words, the probability of an event A+ occurring is the probability of the event A+ occurring and the event B+ occurring plus the probability of event A+ occurring and the probability of event B+ not occurring (B-). This might be clearer if we remember $$1 = P(B+) + P(B-)$$ and we think of probabilities using a Venn diagram.

The more general form of this law is:

$P(A) = \sum_i{P(A \cap B_i)} = \sum_i{P(A | B_i)P(B_i)}$

# The law of total probability and conditional probabilities

One of the most useful forms of Bayes' theorem relies on the combination of the law of total probability and conditional probability. Here's the key relationship:

$1 = P(A | B) + P(\bar A | B)$

Let me put this into words. If event B happens, then either A or not A happens, there are no other options, so the two probabilities must sum to 1.

# What use is probability theory?

I grew up hearing about the value of 'common sense', but probability theory often gives results that seem very counterintuitive and 'common sense' can lead you wildly astray. A fun example is the Monty Hall problem, but there are lots of other examples in the real world where the probability of something happening is not what it appears to be at first - and they're not so fun. The counter-intuitive example you find most often on the internet is the probability that you have a disease given a positive test result; it's mostly not what you think.

Bayes' theorem takes us into the world of the counter-intuitive and I'll talk about Bayes in a future blog post.

## Saturday, February 22, 2020

### The Monty Hall Problem

Everyone thinks they understand probability, but every so often, something comes along that shows that maybe you don’t actually understand it at all. The Monty Hall problem is a great example of something that seems very counterintuitive and teaches us to be very wary of "common sense".

The problem got its fame from a 1990 column written by Marilyn vos Savant in Parade magazine. She posed the problem and provided the solution, but the solution seemed so counterintuitive that several math professors and many PhDs wrote to her saying she was incorrect. The discussion was so intense, it even reached the pages of the New York Times. But vos Savant was indeed correct.

(Monty Hall left (1976) - image credit: ABC Television - source Wikimedia Commons, no known copyright, Marilyn vos Savant right (2017) - image credit: Nathan Hill via Wikimedia Commons - Creative Commons License.  Note: the reason why the photos are from different years/ages is the availability of open-source images.)

The problem is loosely based on a real person and a real quiz show. In the US, there’s a long-running quiz show called ‘Let’s make a deal’, and its host for many years was Monty Hall, in whose honor the problem is named. Monty Hall was aware of the fame of the problem and had some interesting things to say about it.

Vos Savant posed the Monty Hall problem in this form:

• A quiz show host shows a contestant three doors. Behind two of them is a goat and behind one of them is a car. The goal is to win the car.
• The host asked the contestant to choose a door, but not open it.
• Once the contestant has chosen a door, the host opens one of the other doors and shows the contestant a goat. The contestant now knows that there’s a goat behind that door, but he or she doesn’t know which of the other two doors the car’s behind.
• Here’s the key question: the host asks the contestant "do you want to change doors?".
• Once the contestant decided whether to switch or not, the host opens the contestant's chosen door and the contestant wins the car or a goat.
• Should the contestant change doors when asked by the host? Why?

What do you think the probability of winning is if the contestant does not change doors? What do you think the probability of winning is if they do?

Here are the results.

• If the contestant sticks with their choice, they have a ⅓ chance of winning.
• If the contestant changes doors, they have a ⅔ chance of winning.

What?

This is probably not what you expected, so let’s investigate what’s going on.

I’m going to start with a really simple version of the game. The host shows me three doors and asks me to choose one. There’s a ⅓ probability of the car being behind my door and ⅔ probability of the car being behind the other two doors.

Now, let’s add in the host opening one of the other doors I haven’t chosen, showing me a goat, and asking me if I want to change doors. If I don’t change doors, the probability of me winning is ⅓ because I haven’t taken into account the extra information the host has given me.

What happens if I change my strategy? When I made my initial choice of doors, there was a ⅔ probability the car was behind one of the other two doors. That can't change. Whatever happens, there are still three doors and the car must be behind one of them. There’s a ⅔ probability that the car is behind one of the two doors.

Here’s where the magic happens. When the host opens a door and shows me a goat, there’s now a 0 probability that the car’s behind that door. But there was a ⅔ probability the car was behind one of the two doors before, so this must mean there’s a ⅔ probability the car is behind the remaining door!

There are more formal proofs of the correctness of this solution, but I won’t go into them here. For those of you into Bayes theorem, there’s a really nice formal proof.

I know some of you are probably completely unconvinced. I was at first too. Years ago, I wrote a simulator and did 1,000,000 simulations of the game. Guess what? Sticking gave a ⅓ probability and changing gave a ⅔ probability. You don’t even have to write a simulator anymore, there are many websites offering simulations of the game so you can try different strategies.

If you want to investigate the problem in-depth, read Rosenhouse's book. It's 174 pages on this problem alone, covering the media furor, basic probability theory, Bayes theory, and various variations of the game. It pretty much beats the problem to death.

The Monty Hall problem is a fun problem, but it does serve to illustrate a more serious point. Probability theory is often much more complex than it first appears and the truth can be counter-intuitive. The problem teaches us humility. If you’re making business decisions on multiple probabilities, are you sure you’ve correctly worked out the odds?

# References

• The Wikipedia article on the Monty Hall problem is a great place to start.
• New York Times article about the 1990 furor with some background on the problem.
• Washington Post article on the problem.
• 'The Monty Hall Problem', Jason Rosenhouse - is an entire book on various aspects of the problem. It's 174 pages long but still doesn't go into some aspects of it (e.g. the quantum variation).