Showing posts with label probability. Show all posts
Showing posts with label probability. Show all posts

Thursday, August 28, 2025

The sisters "paradox" - counter-intuitive probability

It seems simple, but it isn't

There are a couple of famous counter-intuitive problems in probability theory and the sisters "paradox" is one of them. I'll tell you the problem, let you guess the solution, and then give you some of the background.

Here's the problem: a family has two children. You're told that at least one of them is a girl. What's the probability both are girls?

(International Film Service / American Releasing Co., Public domain, via Wikimedia Commons)

Assume that the probability of having a girl or boy is 50% and that the birth order has no effect on the probability. Assume the family is selected at random because they have at least one girl.

What do you think the probability is that both children are girls?

A simpler question

Let's image you're asked a simpler question.

A family has two children. What's the probability both are girls?

We can work this out using a simple probability tree:

Boy (0.5)                                                Girl (0.5)

/              \                                               /                \

Boy-Boy (0.25)       Boy-Girl (0.25)            Girl-Boy (0.25)      Girl-Girl (0.25) 

So the probability of two girls is 0.25.

Note there are two ways of having a boy and a girl, so the total probability of having a boy and a girl (in any order) is 0.5.

The wrong answer

Let's go back to the original problem and see the logic behind the most-often given wrong answer.

The birth chance is 0.5 boy and 0.5 girl. We don't know the gender of one of the children, but it must be a 0.5 probability it's a girl. Given the fact we already know one of the children is a girl, the probability of their being two girls must therefore be 0.5.

It sounds right because it sounds logical, but it isn't right for reasons as I'll explain next.

The correct answer

The correct answer is 1/3. Let's see why.

In the probability tree above, we can see four equally likely combinations: {Boy-Boy} (0.25), {Boy-Girl} (0.25), {Girl-Boy} (0.25), and {Girl-Girl} (0.25). We're told in the problem that the {Boy-Boy} combination is ruled out, which leaves us with three remaining combinations. Each of these three remaining combinations is equally likely and it has to be one of them, which means the probability of two girls is 1/3.

There are two ways of having a boy and a girl, {Boy-Girl} and {Girl-Boy}, which means there's a 2/3 probability of having a boy and a girl (in any order). The mistake is to consider that a 0.5 probability.

Sample space

The underlying method to solve this problem is to use something called the 'sample space' which is the set of all possible outcomes of a trial. In our case, the set of all outcomes is {{Boy-Girl}, {Girl-Boy}, {Girl-Girl}}. We can associate probabilities with each of the elements of our sample space. In our case, they're all 1/3.

The sample space idea helps us solve various versions of the problem, here's an example. If we're told the eldest child is a girl, does this change anything? Actually, it does. The sample space becomes {Boy-Girl}, {Girl-Girl}, so the probability is now a 1/2 (eldest child is last on list). Why? Because the {Girl-Boy} combination isn't possible.

How might you test this?

With problems like this that seem counter-intuitive, a good way forward it to actually test the theory. Plainly, it would be expensive to ask people for real, but we can do a computer simulation. Here are the steps.

  1. Randomly create a large number of two-children families with the sample space {Boy-Boy}, {Boy-Girl}, {Girl-Boy}, {Girl-Girl} and probabilities 1/4, 1/4, 1/4, and 1/4.
  2. Select only the families that have at least one girl.
  3. Now figure out the fraction of all the selected families that are {Girl-Girl}.  
Interestingly, if you think about ways of testing a solution, it often helps you define the problem a bit better. I found just writing the test process down helped me confirm the correct answer.

Controversy, complexity, and meaning

I've presented a simple analysis here, but you should be aware that things can get a lot, lot more complex. The Wikipedia article on the Boy or girl paradox goes into some painful detail about the problem and the controversy around it. Without going into too much detail, the detailed text of the problem is important.

This might seem abstract, but I've seen variations of this problem pop up in business and I've had difficult conversations with non-technical people as a result. It's especially hard when the "common sense" error gives a more optimistic answer than the correct answer. Realistically, the only way forward is prior eduction and the use of sample space arguments.

Probability theory, and conditional probability in particular, can give some very counter-intuitive results. Here's my advice if you're working with probabilities:

  • Be as precise as you can be and list all your assumptions.
  • Figure out how you might run a computer simulation to test your theory. Go back and look at the problem definition once you've defined your simulation.
  • Don't rely on "common sense".

Friday, August 15, 2025

Probability playground: a great app/website

What a great website!

I've played around with probability distributions for more than two decades and I'm still finding out new things. Wikipedia helps of course, but it's sometimes obscure and misses things out.

Recently, I came across some wonderful webpages at the University of Buffalo. The pages were created by Adam Cunn (https://www.acsu.buffalo.edu/~adamcunn/) and are all about probability distributions. I've learned some interesting things from his pages and I thought I would share on my blog.

Probability Playground

The Probability Playground pages are really an app. The home page (https://www.acsu.buffalo.edu/~adamcunn/probability/probability.html) shows you the "core" probability distributions and how they're related. Clicking on any distribution takes you into the app proper.

Let's click on the beta distribution and I'll show you some interesting  stuff: https://www.acsu.buffalo.edu/~adamcunn/probability/beta.html

  • Click on the top right box labeled "Transformation". The dropdown box tells you that the limiting case for the beta distribution is the normal distribution (click it and see) and that a special case of the beta distribution is the normal distribution. 
  • Look on the bottom left of the page and you'll see several examples of the beta distribution in the real world. These are all great examples.
  • You can explore the effect of changing the distribution parameters on the PDF and CDF. It's a vivid demonstration that the beta distribution is a family of distributions.
As you can see, there's more on the page to explore and you can really get a sense of how distributions differ from one another. 

Of course, there are hundreds of different distributions and this app only has the "greatest hits", but that's fine. It was created as a teaching tool and it serves its purpose very well.

Stop reading this and go try it yourself

I like this app a lot, so I'm going to stop writing and tell you to go off and click on the site for yourself. Here's the link again: https://www.acsu.buffalo.edu/~adamcunn/probability/probability.html

Thursday, February 13, 2025

Why assuming independence can be very, very bad for business

Independence in probability

Why should I care about independence?

Many models in the finance industry and elsewhere assume events are independent. When this assumptions fails, catastrophic losses can occur as we saw in 2008 and 1992. The problem is, developers and data scientists assume independence because it greatly simplifies problems, but the executive team often don't know this has happened, or even worse, don't understand what it means. As a result, the company ends up being badly caught out when circumstances change and independence no longer applies.

(Sergio Boscaino from Busseto, Italy, CC BY 2.0 , via Wikimedia Commons)

In this post, I'm going to explain what independence is, why people assume it, and how it can go spectacularly wrong. I'll provide a some guidance for managers so they know the right questions to ask to avoid disaster. I've pushed the math to the end, so if math isn't your thing, you can leave early and still get benefit.

What is independence?

Two events are independent if the outcome of one doesn't affect the other in any way. For example, if I throw two dice, the probability of me throwing a six on the second die isn't affected in any way by what I throw on the first die. 

Here are some examples of independent events:

  • Throwing a coin and getting a head, throwing a dice and getting a two.
  • Drawing a king from a deck of cards, winning the lottery having bought a ticket.
  • Stopping at at least one red light on my way to the store, rain falling two months from now.
By contrast, some events are not independent (they're dependent):
  • Raining today and raining tomorrow. Rain today increases the chances of rain tomorrow.
  • Heavy snow today and a football match being played. Heavy snow will cause the match to be postponed.
  • Drawing a king from a deck of cards, then without replacing the card, drawing a king on the second draw.

Why people assume independence

People assume independence because the math is a lot, lot simpler. If two events are dependent, the analyst has to figure out the relationship between them, something that can be very challenging and time consuming to do. Other than knowing there's a relationship, the analyst might have no idea what it is and there may be no literature to guide them. If you have no idea what the relationship is, it's easier to assume there's none.

Sometimes, analysts assume independence because they don't know any better. If they're not savvy about probability theory, they may do a simple internet search on combining probabilities that will suggest all they have to do is multiple probabilities, and they assume they can use this process all the time. This is assuming independent through ignorance. I believe people are making this mistake in practice because I've interviewed candidates with MS degrees in statistics who've made this kind of blunder.

Money and fear can also drive the choice to assume independence. Imagine you're an analyst. Your manager is on your back to deliver a model as soon as possible. If you assume independence, your model will be available on time and you'll get your bonus, if you don't, you won't hit your deadline and you won't get your bonus. Now imagine the bad consequences of assuming independence won't be visible for a while. What would you do?

Harder examples

Do you think the following are independent?

  • Two unrelated people in different towns defaulting on their mortgage at the same time
  • Houses in different towns suffering catastrophic damage (e.g. fire, flood, etc.)

Most of the time, these events will be independent. For example, a house burning down because of poor wiring doesn't tell you anything about the risk of a house burning down in a different town (assuming a different electrician!). But there are circumstances when the independence assumption fails:

  • A hurricane hits multiple towns at once causing widespread catastrophic damage in different insurance categories (e.g. hurricane Andrew in 1992).
  • A recession hits, causing widespread lay-offs and mortgage defaults, especially for sub-prime mortgages (2008).

Why independence fails

Prior to 1992, the insurance industry had relatively simple risk models. They assumed independence; an assumption that worked well for some time. In an average year, they knew roughly how many claims there would be for houses, cars etc. Car insurance claims were independent of house insurance claims that in turn were independent of municipal and corporate insurance claims and so on. They were independent, at least in part, because there was no common causal mechanism.

When hurricane Andrew hit Florida in 1992, it  destroyed houses, cars, schools, hospitals etc. across multiple towns. The assumption of independence just wasn't true in this case. The insurance claims were sky high and bankrupted several companies. 

(Hurricane Andrew, houses destroyed in Dade County, Miami. Image from FEMA. Source: https://commons.wikimedia.org/wiki/File:Hurricane_andrew_fema_2563.jpg)

To put it simply, the insurance computer models didn't adequately model the risk because they had independence baked in.  

Roll forward 15 years and something similar happened in the financial markets. Sub-prime mortgage lending was build on a set of assumptions, including default rates. The assumption was, mortgage defaults were independent of one another. Unfortunately, as the 2008 financial crisis hit, this was no longer valid. As more people were laid-off, the economy went down, so more people were laid-off. This was often called contagion but perhaps a better analogy is the reverse of a well known saying: "a rising tide floats all boats".


Financial Crisis Newspaper
(Image credit: Secret London 123, CC BY-SA 2.0, via Wikimedia Commons)

The assumption of independence simplified the analysis of sub-prime mortgages and gave the results that people wanted. The incentives weren't there to price in risk. Imagine your company was making money hand over fist and you stood up and warned people of the risks of assuming independence. Would you put your bonus and your future on the line to do so?

What to do - recommendations

Let's live in the real world and accept that assuming independence gets us to results that are usable by others quickly.

If you're a developer or a data scientist, you must understand the consequences of assuming independence and you must recognize that you're making that assumption.  You must also make it clear what you've done to your management.

If you're a manager, you must be aware that assuming independence can be dangerous but that it gets results quickly. You need to ask your development team about the assumptions they're making and when those assumptions fail. It also means accepting your role as a risk manager; that means investing in development to remove independence.

To get results quickly, it may well be necessary for an analyst to assume independence.  Once they've built the initial model (a proof of concept) and the money is coming in, then the task is to remove the independence assumption piece-by-piece. The mistake is to stop development.

The math

Let's say we have two events, A and B, with probabilities of occurring P(A) and P(B). 

If the events are independent, then the probability of them both occurring is:

\[P(A \ and \ B) = P(A  \cap B) = P(A) P(B)\]

This equation serves as both a definition of independence and test of independence as we'll see next.

Let's take two cases and see if they're independent:

  1. Rolling a dice and getting a 1 and a 2
  2. Rolling a dice and getting a (1 or 2) and (2, 4, or 6)

For case 1, here are the probabilities:
  • \(P(A) = 1/6\)
  • \(P(B) = 1/6\)
  • \(P(A  \cap B) = 0\), it's not possible to get 1 and 2 at the same time
  • \(P(A )P(B) = (1/6) * (1/6)\)
So the equation \(P(A \ and \ B) = P(A  \cap B) = P(A) P(B)\) isn't true, therefore the events are not independent.

For case 2, here are the probabilities:
  • \(P(A) = 1/3\)
  • \(P(B) = 1/2\)
  • \(P(A  \cap B) = 1/6\)
  • \(P(A )P(B) = (1/2) * (1/3)\)
So the equation is true, therefore the events are independent.

Dependence uses conditional probability, so we have this kind of relationship:
\[P(A \ and \ B) = P(A  \cap B) = P(A | B) P(B)\]
The expression \(P(A | B)\) means the probability of A given that B has occurred (e.g the probability the game is canceled given that it's snowed). There are a number of ways to approach finding \(P(A | B)\), the most popular over the last few years has been Bayes' Theorem which states:
\[P(A | B) = \frac{P(B | A) P(A)}{P(B)}\]
There's a whole methodology that goes with the Bayesian approach and I'm not going to go into it here, except to say that it's often iterative; we make an initial guess and progressively refine it in the light of new evidence. The bottom line is, this process is much, much harder and much more expensive than assuming independence. 

Monday, November 28, 2022

Is this coin biased?

Tossing and turning

A few months ago, someone commented on one of my blog posts and asked how you work out if a coin is biased or not. I've been thinking about the problem since then. It's not a difficult one, but it does bring up some core notions in probability theory and statistics which are very relevant to understanding how A/B testing works, or indeed any kind of statistical test. I'm going to talk you through how you figure out if a coin is biased, including an explanation of some of the basic ideas of statistical tests.

The trial

A single coin toss is an example of something called a Bernoulli trial, which is any kind of binary decision you can express as a success or failure (e.g. heads or tails). For some reason, most probability texts refer to heads as a success. 

We can work out what the probability is of getting different numbers of heads from a number of tosses, or more formally, what's the probability \(P(k)\) of getting \(k\) heads from \(n\) tosses, where \(0 < k ≤ n\)? By hand, we can do it for a few tosses (three in this case):

 Number of heads (k) Combinations (n)  Count  Probability 
 0 TTT 1 1/8
 1 HTT
 THT
 TTH
 3 3/8 
 2 THH
 HTH
 HHT
 3  3/8
 4 HHH 1 1/8

But what about 1,000 or 1,000,000 tosses - we can't do this many by hand, so what can we do? As you might expect, there's a formula you can use: 
\[P(k) = \frac{n!} {k!(n-k)!} p^k (1-p)^{n-k}\]
\(p\) is the probability of success in any trial, for example, getting a head. For an unbiased coin \(p=0.5\); for a coin that's biased 70% heads \(p=0.7\). 

If we plot this function for an unbiased coin (\(p=0.5\)), where \(n=100\), and \(0 < k ≤ n\), we see this probability distribution:

This is called a binomial distribution and it looks a lot like the normal distribution for large (\(> 30\)) values of \(n\). 

I'm going to re-label the x-axis as a score equal to the fraction of heads: 0 means all tails, 0.5 means \(\frac{1}{2}\) heads, and 1 means all heads. With this slight change, we can more easily compare the shape of the distribution for different values of \(n\). 

I've created two charts below for an unbiased coin (\(p=0.5\)), one with \(n=20\) and one with \(n=40\). Obviously, the \(n=40\) chart is narrower, which is easier to see using the score as the x-axis. 

As an illustration of what these charts mean, I've colored all scores 0.7 and higher as red. You can see the red area is bigger for \(n=20\) than \(n=40\). Bear in mind, the red area represents the probability of a score of 0.7 or higher. In other words, if you toss a fair coin 20 times, you have a 0.058 chance of seeing a score of 0.7 or more, but if you toss a fair coin 40 times, the probability of seeing a 0.7 score drops to 0.008.


These charts tell us something useful: as we increase the number of tosses, the curve gets narrower, meaning the probability of getting results further away from \(0.5\) gets smaller. If we saw a score of 0.7 for 20 tosses, we might not be able to say the coin was biased, but if we got a score of 0.7 after 40 tosses, we know this score is very unlikely so the coin is more likely to be biased.

Thresholds

Let me re-state some facts:

  • For any coin (biased or unbiased) any score from 0 to 1 is possible for any number of tosses.
  • Some results are less likely than others; e.g. for an unbiased coin and 40 tosses, there's only a 0.008 chance of seeing a score of 0.7.

We can use probability thresholds to decide between biased and non-biased coins.  We're going to use a threshold (mostly called confidence) of 95% to decide if the coin is biased or not. In the chart below, the red areas represent 5% probability, and the blue areas 95% probability.

Here's the idea to work out if the coin is biased. Set a confidence value, usually at 0.05. Throw the coin \(n\) times, record the number of heads and work out a score. Draw the theoretical probability chart for the number of throws (like the one I've drawn above) and color in 95% of the probabilities blue and 5% red. If the experimental score lands in the red zones, we'll consider the coin to be biased, if it lands in the blue zone, we'll consider it unbiased.

This is probabilistic decision-making. Using a confidence of 0.05 means we'll wrongly say a coin is biased 5% of the time. Can we make the threshold higher, could we use 0.01 for instance? Yes, we could, but the cost is increasing the number of trials.

As you might expect, there are shortcuts and we don't actually have to draw out the chart. In Python, you can use the binom_test function in the stats package. 

To simplify, binom_test has three arguments:

  • x - the number of successes
  • n - the number of samples
  • p - the hypothesized probability of success
It returns a p-value which we can use to make a decision.

Let's see how this works with a confidence of 0.05. Let's take the case where we have 200 coin tosses and 140 (70%) of them come up heads. We're hypothesizing that the coin is fair, so \(p=0.5\).

from scipy import stats
print(stats.binom_test(x=140, n=200, p=0.5))

the p-value we get is 1.5070615573524992e-08 which is way less than our confidence threshold of 0.05 (we're in the red area of the chart above). We would then reject the idea the coin is fair.

What if we got 112 heads instead?

from scipy import stats
print(stats.binom_test(x=115, n=200, p=0.5))

This time, the p-value is 0.10363903843786755, which is greater than our confidence threshold of 0.05 (we're in the blue area of the chart), so the result is consistent with a fair coin (we fail to reject the null).

What if my results are not significant? How many tosses?

Let's imagine you have reason to believe the coin is biased. You throw it 200 times and you see 115 heads. binom_test tells you you can't conclude the coin is biased. So what do you do next?

The answer is simple, toss the coin more times.

The formulae for the sample size, \(n\), is:

\[n = \frac{p(1-p)} {\sigma^2}\]

where \(\sigma\) is the standard error. 

Here's how this works in practice. Let's assume we think our coin is just a little biased, to 0.55, and we want the standard error to be \(\pm 0.04\). Here's how many tosses we would need: 154. What if we want more certainty, say \(\pm 0.005\), then the number of tosses goes up to 9,900. In general, the bigger the bias, the fewer tosses we need, and the more certainty we want the more tosses we need. 

If I think my coin is biased, what's my best estimate of the bias?

Let's imagine I toss the coin 1,000 times and see 550 heads. binom_test tells me the result is significant and it's likely my coin is biased, but what's my estimate of the bias? This is simple, it's actually just the mean, so 0.55. Using the statistics of proportions, I can actually put a 95% confidence interval around my estimate of the bias of the coin. Through math I won't show here, using the data we have, I can estimate the coin is biased 0.55 ± 0.03.

Is my coin biased?

This is a nice theoretical discussion, but how might you go about deciding if a coin is biased? Here's a step-by-step process.

  1. Decide on the level of certainty you want in your results. 95% is a good measure.
  2. Decide the minimum level of bias you want to detect. If the coin should return heads 50% of the time, what level of bias can you live with? If it's biased to 60%, is this OK? What about biased to 55% or 50.5%?
  3. Calculate the number of tosses you need.
  4. Toss your coin.
  5. Use binom_test to figure out if the coin deviates significantly from 0.5.

Other posts like this

Friday, December 31, 2021

COVID and the base rate fallacy

COVID and the base rate fallacy

Should we be concerned that vaccinated people are getting COVID?

I’ve spoken to people who’re worried that the COVID vaccines aren’t effective because some vaccinated people catch COVID and are hospitalized. Let’s look at the claim and see if it stands up to analysis.

Let's start with some facts:

Marc Rummy’s diagram

Marc Rummy created this diagram to explain what’s going on with COVID hospitalizations. He’s made it free to share, which is fantastic.

In this diagram, the majority of the population is vaccinated (91%). The hospitalization rate for the unvaccinated is 50% but for the vaccinated, it’s 10%. If the total population is 110, this leads to 5 unvaccinated people hospitalized and 10 vaccinated people hospitalized - in other words, 2/3 of those in hospital with COVID have been vaccinated. 

Explaining the result

Let’s imagine we just looked at hospitalizations: 5 unvaccinated and 10 vaccinated. This makes it look like vaccinations aren’t working – after all, the majority of people in hospital are vaccinated. You can almost hear ignorant journalists writing their headlines now (“Questions were raised about vaccine effectiveness when the health minister revealed the majority of patients hospitalized had been vaccinated.”). But you can also see anti-vaxxers seizing on these numbers to try and make a point about not getting vaccinated.

The reason why the numbers are the way they are is because the great majority of people are vaccinated

Let’s look at three different scenarios with the same population of 110 people and the same hospitalization rates for vaccinated and unvaccinated:

  • 0% vaccinated – 55 people hospitalized
  • 91% vaccinated – 15 people hospitalized
  • 100% vaccinated – 11 people hospitalized

Clearly, vaccinations reduce the number of hospitalizations. The anti-vaccine argument seems to be, if it doesn't reduce the risk to zero, it doesn't work - which is a strikingly weak and ignorant argument.

In this example, vaccination doesn’t reduce the risk of infection to zero, it reduces it by a factor of 5. In the real world, vaccination reduces the risk of infection by 5x and the risk of death due to COVID by 13x (https://www.nytimes.com/interactive/2021/us/covid-cases.html). The majority of people hospitalized now appear to be unvaccinated even though vaccination rates are only just above 60% in most countries (https://www.nytimes.com/interactive/2021/world/covid-cases.html, https://www.masslive.com/coronavirus/2021/09/breakthrough-covid-cases-in-massachusetts-up-to-about-40-while-unvaccinated-people-dominate-hospitalizations.html).

The bottom line is very simple: if you want to reduce your risk of hospitalization and protect your family and community, get vaccinated.

The base rate fallacy

The mistake the anti-vaxxers and some journalists are making is a very common one, it’s called the base rate fallacy (https://thedecisionlab.com/biases/base-rate-fallacy/). There are lots of definitions online, so I’ll just attempt a summary here: “the base rate fallacy is where someone draws an incorrect conclusion because they didn’t take into account the base rate in the general population. It’s especially a problem for conditional probability problems.”

Let’s use another example from a previous blog post:

“Imagine there's a town of 10,000 people. 1% of the town's population has a disease. Fortunately, there's a very good test for the disease:

  • If you have the disease, the test will give a positive result 99% of the time (sensitivity).
  • If you don't have the disease, the test will give a negative result 99% of the time (specificity).

You go into the clinic one day and take the test. You get a positive result. What's the probability you have the disease?” 

The answer is 50%.

The reason why the answer is 50% and not 99% is because 99% of the town’s population does not have the disease (the base rate), which means half of the positives will be false positives.

What’s to be done?

Conditional probability (for example, the COVID hospitalization data) is screwy and can sometimes seem counter to common sense. The general level of statistical (and probability) knowledge in the population is poor. This leaves people trying to make sense of the data around them but without the tools to do it, so no wonder they’re confused.

It’s probably time that all schoolchildren are taught some basic statistics. This should include some counter-intuitive results (for example, the disease example above). Even if very few schoolchildren grow up to analyze data, it would be beneficial for society if more people understood that interpreting data can be hard and that sometimes surprising results occur – but that doesn’t make them suspicious or wrong.

More importantly, journalists need to do a much better job of telling the truth and explaining the data instead of chasing cheap clicks.

Monday, November 1, 2021

Why conditional probability screwiness matters for business

Things are not what they seem

Many business decisions come down to common sense or relatively simple math. But applying common sense to conditional probability problems can lead to very wrong results as we'll see. As data science becomes more and more important for business, decisions involving conditional probability will arise more often. In this blog post, I'm going to talk through some counter-intuitive conditional probability examples and where I can, I'll tell you how they arise in a business context.

(These two pieces of track are the same size. Ag2gaeh, CC BY-SA 4.0, via Wikimedia Commons.)

Testing for diseases

This is the problem with the clearest links to business. I'll explain the classical form of the problem and show you how it can come up in a business context.

Imagine there's some disease affecting a small fraction of the population, say 1%. A university develops a great test for the disease:

  • If you have the disease, the test will give you a positive result 99% of the time. 
  • If you don't have the disease, the test will give you a negative result 99% of the time.

You take the test and it comes back positive. What's the probability you have the disease? 

(COVID test kit. Centers for Disease Control and Prevention, Public domain, via Wikimedia Commons)

The answer is 50%.

If you want an explanation of the 50% number, read the section "The math", if you want to know how it comes up in business, skip to the section "How it comes up in business".

The math

What's driving the result is the low prevalence of the disease (1%). 99% of the people who take the test will be uninfected and it's this that pushes down the probability of having the disease if you test positive. 

There are at least two ways of analyzing this problem, one is using a tree diagram and one is using Bayes' Theorem. In a previous blog post, I went through the math in detail, so I'll just summarize the simpler explanation using a tree diagram. To make it easier to understand, I'll assume a population of 10,000.

Of the 10,000 people, 100 have the disease, and 9,900 do not. Of the 100, 99 will test positive for the disease. Of the 9,900, 99 will test positive for the disease. In total 99 + 99 will test positive, of which only 99 will have the disease. So 50% of those who test positive will have the disease.

How it comes up in business

Instead of disease tests, let's think of websites and algorithms. Imagine you're the CEO of a web-based business. 1% of the visitors to your website become customers. You want to identify who'll become a customer, so you task your data science team with developing an algorithm based on users' web behavior. You tell them the test is to distinguish customers from non-customers.

They come back with a complex test for customers that's 99% true for existing customers and 99% false for non-customers. Do you have a test that can predict who will become a customer and who won't?

This is the same problem as before, if the test is positive for a user, there's only a 50% chance they'll become a customer.

How many daughters?

This is a classic problem and shows the importance of describing a problem exactly. Exactly, in this case, means using very very precise English.

Here's the problem in its original form from Martin Gardner:

  1. Mr. Jones has two children. The older child is a girl. What is the probability that both children are girls?
  2. Mr. Smith has two children. At least one of them is a boy. What is the probability that both children are boys?
(What's the probability of two girls? Circle of Robert Peake the elder, Public domain, via Wikimedia Commons)

The solution to the first problem is simple. Assuming boys or girls are equally likely, then it's 50%.

The second problem isn't simple and has generated a great deal of debate, even 60 years after Martin Gardner published the puzzle. Depending on how you read the question, the answer is either 50% or 33%. Here's Khovanova's explanation:

"(i) Pick all the families with two children, one of which is a boy. If Mr. Smith is chosen randomly from this list, then the answer is 1/3.

(ii) Pick a random family with two children; suppose the father is Mr. Smith. Then if the family has two boys, Mr. Smith says, “At least one of them is a boy.” If he has two girls, he says, “At least one of them is a girl.” If he has a boy and a girl he flips a coin to say one or another of those two sentences. In this case, the probability that both children are the same sex is 1/2."

In fact, there are several other possible interpretations.

What does this mean for business? Some things that sound simple aren't and differences in the precise way a problem is formulated can give wildly different answers.

Airline seating

Here's the problem stated from an MIT handout:

"There are 100 passengers about to board a plane with 100 seats. Each passenger is assigned a distinct seat on the plane. The first passenger who boards has forgotten his seat number and sits in a randomly selected seat on the plane. Each passenger who boards after him either sits in his or her assigned seat if it is empty or sits in a randomly selected seat from the unoccupied seats. What is the probability that the last passenger to board the plane sits in her assigned seat?"

You can imagine a lot of seat confusion, so it seems natural to assume that the probability of the final passenger sitting in her assigned seat is tiny. 

(Ken Iwelumo (GFDL 1.2, GFDL 1.2 or GFDL 1.2), via Wikimedia Commons)

Actually, the probability of her sitting in her assigned seat is 50%.

StackOverflow has a long discussion on the solution to the problem that I won't repeat here.

What does this mean for business? It's yet another example of our intuition letting us down.

The Monty Hall problem 

This is the most famous of all conditional probability problems and I've written about it before. Here's the problem as posed by Vos Savant:

"A quiz show host shows a contestant three doors. Behind two of them is a goat and behind one of them is a car. The goal is to win the car.

The host asked the contestant to choose a door, but not open it.

Once the contestant has chosen a door, the host opens one of the other doors and shows the contestant a goat. The contestant now knows that there’s a goat behind that door, but he or she doesn’t know which of the other two doors the car’s behind.

Here’s the key question: the host asks the contestant "do you want to change doors?".

Once the contestant decided whether to switch or not, the host opens the contestant's chosen door and the contestant wins the car or a goat.

Should the contestant change doors when asked by the host? Why?"

Here are the results.

  • If the contestant sticks with their initial choice, they have a ⅓ chance of winning.
  • If the contestant changes doors, they have a ⅔ chance of winning.
I go through the math in these two previous blog posts "The Monty Hall Problem" and "Am I diseased? An introduction to Bayes theorem".

Once again, this shows how counter-intuitive probability questions can be.

What should your takeaway be, what can you do?

Probability is a complex area and common sense can lead you wildly astray. Even problems that sound simple can be very hard. Things are made worse by ambiguity; what seems a reasonable problem description in English might actually be open to several possible interpretations which give very different answers.

(Sound judgment is needed when dealing with probability. You need to think like a judge, but you don't have to dress like one. InfoGibraltar, CC BY 2.0, via Wikimedia Commons)

If you do have a background in probability theory, it doesn't hurt to remind yourself occasionally of its weirder aspects. Recreational puzzles like the daughters' problem are a good refresher.

If you don't have a background in probability theory, you need to realize you're liable to make errors of judgment with potentially serious business consequences. It's important to listen to technical advice. If you don't understand the advice, you have three choices: get other advisors, get someone who can translate, or hand the decision to someone who does understand. 

Monday, July 12, 2021

What is beta in statistical testing?

\(\beta\) is \(\alpha\) if there's an effect

In hypothesis testing, there are two kinds of errors:

  • Type I - we say there's an effect when there isn't. The threshold here is \(\alpha\).
  • Type II - we say there's no effect when there really is an effect. The threshold here is \(\beta\).

This blog post is all about explaining and calculating \(\beta\).


The null hypothesis

Let's say we do an A/B test to measure the effect of a change to a website. Our control branch is the A branch and the treatment branch is the B branch. We're going to measure the conversion rate \(C\) on both branches. Here are our null and alternative hypotheses:

  • \(H_0: C_B - C_A = 0\) there is no difference between the branches
  • \(H_1: C_B - C_A \neq 0\) there is a difference between the branches

Remember, we don't know if there really is an effect, we're using procedures to make our best guess about whether there is an effect or not, but we could be wrong. We can say there is an effect when there isn't (Type I error) or we can say there is no effect when there is (Type II error).

Mathematically, we're taking the mean of thousands of samples so the central limit theorem (CLT) applies and we expect the quantity \(C_B - C_A\) to be normally distributed. If there is no effect, then \(C_B - C_A = 0\), if there is an effect \(C_B - C_A \neq 0\).

\(\alpha\) in a picture

Let's assume there is no effect. We can plot out our expected probability distribution and define an acceptance region (blue, 95% of the distribution) and two rejection regions (red, 5% of the distribution). If our measured \(C_B - C_A\) result lands in the blue region, we will accept the null hypothesis and say there is no effect, If our result lands in the red region, we'll reject the null hypothesis and say there is an effect. The red region is defined by \(\alpha\).

One way of looking at the blue area is to think of it as a confidence interval around the mean \(x_0\):

\[\bar x_0 + z_\frac{\alpha}{2} s \; and \; \bar x_0 + z_{1-\frac{\alpha}{2}} s \]

In this equation, s is the standard error in our measurement. The probability of a measurement \(x\) lying in this range is:

\[0.95 = P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] \]

If we transform our measurement \(x\) to the standard normal \(z\), and we're using a 95% acceptance region (boundaries given by \(z\) values of 1.96 and -1.96), then we have for the null hypothesis:

\[0.95 = P[-1.96 < z < 1.96]\]

\(\beta\) in a picture

Now let's assume there is an effect. How likely is it that we'll say there's no effect when there really is an effect? This is the threshold \(\beta\).

To draw this in pictures, I want to take a step back. We have two hypotheses:

  • \(H_0: C_B - C_A = 0\) there is no difference between the branches
  • \(H_1: C_B - C_A \neq 0\) there is a difference between the branches

We can draw a distribution for each of these hypotheses. Only one distribution will apply, but we don't know which one.



If the null hypothesis is true, the blue region is where our true negatives lie and the red region is where the false positives lie. The boundaries of the red/blue regions are set by \(\alpha\). The value of \(\alpha\) gives us the probability of a false positive.

If the alternate hypothesis is true, the true positives will be in the green region and the false negatives will be in the orange region. The boundary of the green/orange regions is set by \(\beta\). The value of \(\beta\) gives us the probability of a false negative.

Calculating \(\beta\)

Calculating \(\beta\) is calculating the orange area of the alternative hypothesis chart. The boundaries are set by \(\alpha\) from the null hypothesis. This is a bit twisty, so I'm going to say it again with more words to make it easier to understand.

\(\beta\) is about false negatives. A false negative occurs when there is an effect, but we say there isn't. When we say there isn't an effect, we're saying the null hypothesis is true. For us to say there isn't an effect, the measured result must lie in the blue region of the null hypothesis distribution.

To calculate \(\beta\), we need to know what fraction of the alternate hypothesis lies in the acceptance region of the null hypothesis distribution.

Let's take an example so I can show you the process step by step.

  1. Assuming the null hypothesis, set up the boundaries of the acceptance and rejection region. Assuming a 95% acceptance region and an estimated mean of x, this gives the acceptance region as:
    \[P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] \] which is the mean and 95% confidence interval for the null hypothesis. Our measurement \(x\) must lie between these bounds.
  2. Now assume the alternate hypothesis is true. If the alternate hypothesis is true, then our mean is \(\bar x_1\).
  3. We're still using this equation from before, but this time, our distribution is the alternate hypothesis.
    \[P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] ] \]
  4. Transforming to the standard normal distribution using the formula \(z = \frac{x - \bar x_1}{\sigma}\), we can write the probability \(\beta\) as:
    \[\beta = P \left [ \frac{\bar x_0 + z_\frac{\alpha}{2} s - \bar x_1}{s} < z < \frac{ \bar x_0 + z_{1-\frac{\alpha}{2}} s - \bar x_1}{s} \right ] \]

This time, let's put some numbers in. 

  • \(n = 200,000\) (100,000 per branch)
  • \(C_B = 0.062\)
  • \(C_A =  0.06\)
  • \(\bar x_0= 0\) - the null hypothesis
  • \(\bar x_1 = 0.002\) - the alternate hypothesis
  • \(s = 0.00107\)  - this comes from combining the standard errors of both branches, so \(s^2 = s_A^2 + s_B^2\), and I'm using the usual formula for the standard error of a proportion, for example, \(s_A = \sqrt{\frac{C_A(1-C_A)}{n} }\)

Plugging them all in, this gives:
\[\beta = P[ -3.829 < z < 0.090]\]
which gives \(\beta = 0.536\)

This is too hard

This process is complex and involves lots of steps. In my view, it's too complex. It feels to me that there must be an easier way of constructing tests. Bayesian statistics holds out the hope for a simpler approach, but widespread adoption of Bayesian statistics is probably a generation or two away. We're stuck with an overly complex process using very difficult language.

Reading more

Monday, May 17, 2021

Counting on Poisson

Why use the Poisson distribution?

Because it has properties that make it great to work with, data scientists use the Poisson distribution to model different kinds of counting data. But these properties can be seductive, and sometimes people model data using the Poisson distribution when they shouldn't. In this blog post, I'll explain why the Poisson distribution is so popular and why you should think twice before using it.

(Siméon-Denis Poisson by E. Marcellot, Public domain, via Wikimedia Commons)

Poisson processes

The Poisson distribution is a discrete event probability distribution used to model events created using a Poisson process. Drilling down a level, a Poisson process is a series of events that have these properties:

  • They occur at random but at a constant mean rate,
  • They are independent of one another, 
  • Two (or more) events can't occur at the same time

Good examples of Poisson processes are website visits, radioactive decay, and calls to a help center. 

The properties of a Poisson distribution

Mathematically, the Poisson probability mass function looks like this:

\[ P_r (X=k) = \frac{\lambda^k e^{- \lambda}}{k!} \]

where 

  • k is the number of events (always an integer)
  • \(\lambda\) is the mean value (or expected rate)

It's a discrete distribution, so it's only defined for integer values of \(k\).

Graphically, it looks like this for \(\lambda=6\). Note that it isn't symmetrical and it stops at 0, you can't have -1 events.

(Let's imagine we were modeling calls per hour in a call center. In this case, \(k\) is the measured calls per hour, \(P\) is their frequency of occurrence, and \(\lambda\) is the mean number of calls per hour).

Here are some of the Poisson distribution's properties:

  • Mean: \(\lambda\)
  • Variance: \(\lambda\)
  • Mode: floor(\(\lambda\))

The fact that some of the key properties are given by \(\lambda\) alone makes using it easy. If your data follows a Poisson distribution, once you know the mean value, you've got the variance (and standard deviation), and the mode too. In fact, you've pretty much got a full description of your data's distribution with just a single number.

When to use it and when not to use it

Because you can describe the entire distribution with just a single number, it's very tempting to assume that any data that involves counting follows a Poisson distribution because it makes analysis easier.  Sadly, not all counts follow a Poisson distribution. In the list below, which counts do you think might follow a Poisson distribution and which might not?

  • The number of goals in English Premier League soccer matches.
  • The number of earthquakes of at least a given size per year around the world.
  • Bus arrivals.
  • The number of web pages a person visits before they make a purchase.

Bus arrivals are not well modeled by a Poisson distribution because in practice they're not independent of one another and don't occur at a constant rate. Bus operators change bus frequencies throughout the day, with more buses scheduled at busy times; they may also hold buses at stops to even out arrival times. Interestingly, bus arrivals are one of the textbook examples of a Poisson process, which shows that you need to think before applying a model.

The number of web pages a person visits before they make a purchase is better modeled using a negative binomial distribution

Earthquakes are well-modeled by a Poisson distribution. Earthquakes in different parts of the world are independent of one another and geological forces are relatively constant, giving a constant mean rate for quakes. It's possible that two earthquakes could happen simultaneously in different parts of the world, which shows that even if one of the criteria might not apply, data can still be well-modeled by Poisson.

What about soccer matches? We know two goals can't happen at the same time. The length of matches is fixed and soccer is a low-scoring game, so the assumption of a constant rate for goals is probably OK. But what about independence? If you've watched enough soccer, you know that the energy level in a game steps up as soon as a goal is scored. Is this enough to violate the independence requirement? Apparently not, scores in soccer matches are well-modeled by a Poisson distribution.

What should a data scientist do?

Just because the data you're modeling is a count doesn't mean it follows a Poisson distribution. More generally, you should be wary of making choices motivated by convenience. If you have count data, look at the properties of your data before deciding on a distribution to model it with. 

If you liked this blog post you might like

Monday, January 4, 2021

COVID and soccer home team advantage - winning less often

Home advantage

Is it easier for a sports team to win at home? The evidence from sports as diverse as soccer [Pollard], American football [Vergina], rugby [Thomas], and ice hockey [Leard] strongly suggest there is a home advantage and it might be quite large. But what causes it? Is it the crowd cheering the home team, or closeness to home, or playing on familiar turf? One of the weirder side-effects of COVID is the insight it's proving into the origins of home advantage, as we'll see.

(Premier League teams playing in happier times. Image source: Wikimedia Commons, License: Creative Commons, Author: Brian Minkoff)

The EPL - lots of data makes analysis easier

The English Premier League is the world's wealthiest sports' league [Robinson].  There's worldwide interest in the league and there has been for a long time, so there's a lot of data available, which makes it ideal for investigating home advantage. One of the nice features of the league is that each team plays every other team twice, once at home and once away.

Expectation and metric

If there were no home team advantage, we would expect the number of home wins and away wins to be roughly equal for the whole league in a season. To investigate home advantage, the metric I'll use is:
\[home \ win \ proportion = \frac{number\ of\ home\ wins}{total\ number\ of\ wins}\]
If there were no home team advantage, we would expect this number to be close to 0.5.

EPL home team advantage

Let's look at the mean home-win proportion per season for the EPL. In the chart, the error bars are the 95% confidence interval.

For most seasons, the home win proportion is about 0.6 and it's significantly above 0.5 (in the statistical sense). In other words, there's a strong home-field advantage in the EPL.

But look at the point on the right. What's going on in 2020-2021?

COVID and home wins

Like everything else in the world, the EPL has been affected by COVID. Teams are playing behind closed doors for the 2020-2021 season. There are no fans singing and chanting in the terraces, there are no fans 'oohing' over near misses, and there are no fans cheering goals. Teams are still playing matches home and away but in empty and silent stadiums.

So how has this affected home team advantage?

Take a look at the chart above. The 2020-2021 season is the season on the right. Obviously, we're still partway through the season, which is why the error bars are so big, but look at the mean value. If there were no home team advantage, we would expect a mean of 0.5. For 2020-2021, the mean is currently 0.491. 

Let me put this simply. When there are fans in the stadiums, there's a home team advantage. When there are no fans in the stadiums, the home team advantage disappears.

COVID and goals

What about goals? It's possible that a team that might have lost is so encouraged by their fans that they reach a draw instead. Do teams playing at home score more goals?

I worked out the mean goal difference between the home team and the away team and I've plotted it for every season from 2000-2001 onwards.

If there were no home team advantage, you would expect the goal difference to be 0. But it isn't. It mostly hovers around 0.35. Except of course for 2020-2021. For 2020-2021, the goal difference is about zero. The home-field advantage has gone.

What this means

Despite the roll-out of the vaccine, it's almost certain the rest of the 2020-2021 season will be played behind closed doors (assuming the season isn't abandoned). My results are for a partial season, but it's a good bet the final results will be similar. If this is the case, then it will be very strong evidence that fans cheering their team really do make a difference.

If you want your team to win, you need to go to their games and cheer them on. 

References

[Leard] Leard B, Doyle JM. The Effect of Home Advantage, Momentum, and Fighting on Winning in the National Hockey League. Journal of Sports Economics. 2011;12(5):538-560.

[Pollard] Richard Pollard and Gregory Pollard, Home advantage in soccer: a review of its existence and causes, International Journal of Soccer and Science Journal Vol. 3 No 1 2005, pp28-44

[Robinson] Joshua Robinson, Jonathan Clegg, The Club: How the English Premier League Became the Wildest, Richest, Most Disruptive Force in Sports, Mariner Books, 2019

[Thomas] Thomas S, Reeves C, Bell A. Home Advantage in the Six Nations Rugby Union Tournament. Perceptual and Motor Skills. 2008;106(1):113-116

[Vergina] Roger C.Vergina, John J.Sosika, No place like home: an examination of the home field advantage in gambling strategies in NFL football, Journal of Economics and Business Volume 51, Issue 1, January–February 1999, Pages 21-31

Monday, December 7, 2020

How to (maybe) win at dice

Does God play dice with the universe?

Imagine I gave you an ordinary die, not special in any way, and I asked you to throw the die and record your results (how many 1s, how many 2s, etc.). What would you expect the results to be? Do you think you could win by choosing some numbers rather than others? Are you sure?

(Image source: Wikimedia Commons. Author: Diacritica. License: Creative Commons.)

What you might expect

Let's say you thew the die 12,000 times, you might expect a probability distribution something like this. This is a uniform distribution where all results are equally likely.

You know you'll never get an absolutely perfect distribution, so in reality, your results might look something like this for 12,000 throws. 

The deviations from the expected values are random noise that we can quantify. Further, we know that by adding more dice throws, random noise gets less and less and we approach the ideal uniform distribution more closely. 

I've simulated dice throws in the plots below, the top chart is 12,000 throws and the chart on the bottom is 120,000 throws. The blue bars represent the actual results, the black circle represents the expected value, and the black line is the 95% confidence interval. Note how the results for 120,000 throws are closer to the ideal than the results from 12,000 throws.


What happened in reality - not what you expect

My results are simulations, but what happens when you throw dice thousands of times in the real world?

There's a short history of probability theorists and statisticians throwing dice and recording the results.

  • Weldon threw 12 dice 26,306 times by hand and sent the results to his friend Francis Galton.
  • Iversen ran an experiment where 219 dice were rolled 20,000 times.

Weldon's data set is widely used to illustrate statistical concepts, especially after Pearson used it to explain his \(\chi^2\) technique in 1900. 

Despite the excitement you see at the craps tables in Las Vegas, throwing dice thousands of times is dull and is, therefore, an ideal job for a computer. In 2009, Zachariah Labby created apparatus for throwing dice and recording the scores using a camera and image processing. You can read more about his apparatus and experimental setup here. He 'threw' 12 dice 26,306 times and his machine recorded the results. 

In the chart below, the blue bars are his results, the black circle is the expected result, and the black line is the 95% confidence interval. I've taken the results from all 12 dice, so my throw count is \(12 \times 26,306\).

This doesn't look like a uniform distribution. To state the obvious, 1 and 6 occurred more frequently than theory would suggest - the deviation from the uniform distribution is statistically significant. The dice he used were not special dice, they were off-the-shelf standard unbiased dice. What's going on?

Unbiased dice are biased

Take a very close look at a normal die, the type pictured at the start of this post which is the kind of die you buy in shops.

By convention, opposite faces on dice sum to 7, so 1 is opposite 6, 3 is opposite to 4, and so on. Now look very closely again at the picture at the start of the post. Look at the dots on the face of the dice. Notice how they're indented. Each hole is the same size, but obviously, the number of holes on each face is different. Let's think of this in terms of weight. Imagine we could weigh each face of the dice. Let's pair up the faces, each side is paired with the face opposite it. Now let's weigh the faces and compare them.


The greatest imbalance in weights is the 1-6 combination. This imbalance is what's causing the bias.

Obviously, the bias is small, but if you roll the die enough times, even a small bias becomes obvious. 

Vegas here I come - or not...

So we know for dice bought in shops that 1 and 6 are ever so slightly more likely to occur than theory suggests. Now you know this, why aren't you booking your flight to Las Vegas? You could spend a week at the craps tables and make a little money.

Not so fast.

Let's look at the dice they use in Vegas.

(Image source: Wikimedia Commons. Author: Alper Atmaca License: Creative Commons.)

Notice that the dots are not indented. They're filled with colored material that's the same density as the rest of the dice. In other words, there's no imbalance, Vegas dice will give a uniform distribution, and 1 and 6 will occur as often as 2, 3, 4, or 5. You're going to have to keep punching the clock.

Some theory

Things are going to get mathematical from here on in. There won't be any new stories about dice or Vegas.

How did I get the expected count and error bars for each dice score? Let's say I threw the dice \(x\) times, it seems obvious we would get an expected count of \(\frac{x}{6}\) for each score, but why? What about the standard error?

Let's re-think the dice as a Bernoulli trial. Let's choose a score, say 1. If we throw the dice and it shows a 1, we consider that a success. If it shows anything else, we consider it a failure. Because we have a Bernoulli trial, we can use the binomial distribution to model the results.

Using Wikipedia's notation:

  • \(n\) is the number of throws
  • \(p\) is the probability of getting a 1, which is \(\frac{1}{6}\)
  • \(q = 1- p\) is the probability of getting 2-6, which is \(\frac{5}{6}\)

So, again using Wikipedia's handy summary, for \(n\) throws:

  • The mean is \(np = 12 \times 26,306 \times \frac{1}{6} = 52,612\)
  • The standard deviation is \(\sqrt{npq} = \sqrt{12 \times 26,306 \times \frac{1}{6} \times \frac{5}{6}} = 209.388\) 
  • The 95% confidence interval is \(52,202 \) to \(53,022\) (standard deviation by 1.96).

Publications

Academics live or die by publications and by citations of their publications. Labby's work has rightly been widely cited on the internet. I keep hoping that some academic will be inspired by Labby and use modern robotic technology and image recognition to do huge (million-plus) classical experiments, like tossing coins or selecting balls from an urn. It seems like an easy win to be widely cited!