Showing posts with label probability. Show all posts
Showing posts with label probability. Show all posts

Monday, July 12, 2021

What is beta in statistical testing?

\(\beta\) is \(\alpha\) if there's an effect

In hypothesis testing, there are two kinds of errors:

  • Type I - we say there's an effect when there isn't. The threshold here is \(\alpha\).
  • Type II - we say there's no effect when there really is an effect. The threshold here is \(\beta\).
This blog post is all about explaining and calculating \(\beta\).

The null hypothesis

Let's say we do an A/B test to measure the effect of a change to a website. Our control branch is the A branch and the treatment branch is the B branch. We're going to measure the conversion rate \(C\) on both branches. Here are our null and alternative hypotheses:

  • \(H_0: C_B - C_A = 0\) there is no difference between the branches
  • \(H_1: C_B - C_A \neq 0\) there is a difference between the branches

Remember, we don't know if there really is an effect, we're using procedures to make our best guess about whether there is an effect or not, but we could be wrong. We can say there is an effect when there isn't (Type I error) or we can say there is no effect when there is (Type II error).

Mathematically, we're taking the mean of thousands of samples so the central limit theorem (CLT) applies and we expect the quantity \(C_B - C_A\) to be normally distributed. If there is no effect, then \(C_B - C_A = 0\), if there is an effect \(C_B - C_A \neq 0\).

\(\alpha\) in a picture

Let's assume there is no effect. We can plot out our expected probability distribution and define an acceptance region (blue, 95% of the distribution) and two rejection regions (red, 5% of the distribution). If our measured \(C_B - C_A\) result lands in the blue region, we will accept the null hypothesis and say there is no effect, If our result lands in the red region, we'll reject the null hypothesis and say there is an effect. The red region is defined by \(\alpha\).

One way of looking at the blue area is to think of it as a confidence interval around the mean \(x_0\):

\[\bar x_0 + z_\frac{\alpha}{2} s \; and \; \bar x_0 + z_{1-\frac{\alpha}{2}} s \]

In this equation, s is the standard error in our measurement. The probability of a measurement \(x\) lying in this range is:

\[0.95 = P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] \]

If we transform our measurement \(x\) to the standard normal \(z\), and we're using a 95% acceptance region (boundaries given by \(z\) values of 1.96 and -1.96), then we have for the null hypothesis:

\[0.95 = P[-1.96 < z < 1.96]\]

\(\beta\) in a picture

Now let's assume there is an effect. How likely is it that we'll say there's no effect when there really is an effect? This is the threshold \(\beta\).

To draw this in pictures, I want to take a step back. We have two hypotheses:

  • \(H_0: C_B - C_A = 0\) there is no difference between the branches
  • \(H_1: C_B - C_A \neq 0\) there is a difference between the branches

We can draw a distribution for each of these hypotheses. Only one distribution will apply, but we don't know which one.



If the null hypothesis is true, the blue region is where our true negatives lie and the red region is where the false positives lie. The boundaries of the red/blue regions are set by \(\alpha\). The value of \(\alpha\) gives us the probability of a false positive.

If the alternate hypothesis is true, the true positives will be in the green region and the false negatives will be in the orange region. The boundary of the green/orange regions is set by \(\beta\). The value of \(\beta\) gives us the probability of a false negative.

Calculating \(\beta\)

Calculating \(\beta\) is calculating the orange area of the alternative hypothesis chart. The boundaries are set by \(\alpha\) from the null hypothesis. This is a bit twisty, so I'm going to say it again with more words to make it easier to understand.

\(\beta\) is about false negatives. A false negative occurs when there is an effect, but we say there isn't. When we say there isn't an effect, we're saying the null hypothesis is true. For us to say there isn't an effect, the measured result must lie in the blue region of the null hypothesis distribution.

To calculate \(\beta\), we need to know what fraction of the alternate hypothesis lies in the acceptance region of the null hypothesis distribution.

Let's take an example so I can show you the process step by step.

  1. Assuming the null hypothesis, set up the boundaries of the acceptance and rejection region. Assuming a 95% acceptance region and an estimated mean of x, this gives the acceptance region as:
    \[P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] \] which is the mean and 95% confidence interval for the null hypothesis. Our measurement \(x\) must lie between these bounds.
  2. Now assume the alternate hypothesis is true. If the alternate hypothesis is true, then our mean is \(\bar x_1\).
  3. We're still using this equation from before, but this time, our distribution is the alternate hypothesis.
    \[P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] ] \]
  4. Transforming to the standard normal distribution using the formula \(z = \frac{x - \bar x_1}{\sigma}\), we can write the probability \(\beta\) as:
    \[\beta = P \left [ \frac{\bar x_0 + z_\frac{\alpha}{2} s - \bar x_1}{s} < z < \frac{ \bar x_0 + z_{1-\frac{\alpha}{2}} s - \bar x_1}{s} \right ] \]

This time, let's put some numbers in. 

  • \(n = 200,000\) (100,000 per branch)
  • \(C_B = 0.062\)
  • \(C_A =  0.06\)
  • \(\bar x_0= 0\) - the null hypothesis
  • \(\bar x_1 = 0.002\) - the alternate hypothesis
  • \(s = 0.00107\)  - this comes from combining the standard errors of both branches, so \(s^2 = s_A^2 + s_B^2\), and I'm using the usual formula for the standard error of a proportion, for example \(s_A = \sqrt{\frac{C_A(1-C_A)}{n} }\)
Plugging them all in, this gives:
\[\beta = P[ -3.829 < z < 0.090]\]
which gives \(\beta = 0.536\)

This is too hard

This process is complex and involves lots of steps. In my view, it's too complex. It feels to me that there must be an easier way of constructing tests. Bayesian statistics holds out the hope for a simpler approach, but widespread adoption of Bayesian statistics is probably a generation or two away. We're stuck with an overly complex process using very difficult language.

Reading more

Monday, May 17, 2021

Counting on Poisson

Why use the Poisson distribution?

Because it has properties that make it great to work with, data scientists use the Poisson distribution to model different kinds of counting data. But these properties can be seductive, and sometimes people model data using the Poisson distribution when they shouldn't. In this blog post, I'll explain why the Poisson distribution is so popular and why you should think twice before using it.

(Siméon-Denis Poisson by E. Marcellot, Public domain, via Wikimedia Commons)

Poisson processes

The Poisson distribution is a discrete event probability distribution used to model events created using a Poisson process. Drilling down a level, a Poisson process is a series of events that have these properties:

  • They occur at random but at a constant mean rate,
  • They are independent of one another, 
  • Two (or more) events can't occur at the same time

Good examples of Poisson processes are website visits, radioactive decay, and calls to a help center. 

The properties of a Poisson distribution

Mathematically, the Poisson probability mass function looks like this:

\[ P_r (X=k) = \frac{\lambda^k e^{- \lambda}}{k!} \]

where 

  • k is the number of events (always an integer)
  • \(\lambda\) is the mean value (or expected rate)

It's a discrete distribution, so it's only defined for integer values of \(k\).

Graphically, it looks like this for \(\lambda=6\). Note that it isn't symmetrical and it stops at 0, you can't have -1 events.

(Let's imagine we were modeling calls per hour in a call center. In this case, \(k\) is the measured calls per hour, \(P\) is their frequency of occurrence, and \(\lambda\) is the mean number of calls per hour).

Here are some of the Poisson distribution's properties:

  • Mean: \(\lambda\)
  • Variance: \(\lambda\)
  • Mode: floor(\(\lambda\))

The fact that some of the key properties are given by \(\lambda\) alone makes using it easy. If your data follows a Poisson distribution, once you know the mean value, you've got the variance (and standard deviation), and the mode too. In fact, you've pretty much got a full description of your data's distribution with just a single number.

When to use it and when not to use it

Because you can describe the entire distribution with just a single number, it's very tempting to assume that any data that involves counting follows a Poisson distribution because it makes analysis easier.  Sadly, not all counts follow a Poisson distribution. In the list below, which counts do you think might follow a Poisson distribution and which might not?

  • The number of goals in English Premier League soccer matches.
  • The number of earthquakes of at least a given size per year around the world.
  • Bus arrivals.
  • The number of web pages a person visits before they make a purchase.

Bus arrivals are not well modeled by a Poisson distribution because in practice they're not independent of one another and don't occur at a constant rate. Bus operators change bus frequencies throughout the day, with more buses scheduled at busy times; they may also hold buses at stops to even out arrival times. Interestingly, bus arrivals are one of the textbook examples of a Poisson process, which shows that you need to think before applying a model.

The number of web pages a person visits before they make a purchase is better modeled using a negative binomial distribution

Earthquakes are well-modeled by a Poisson distribution. Earthquakes in different parts of the world are independent of one another and geological forces are relatively constant, giving a constant mean rate for quakes. It's possible that two earthquakes could happen simultaneously in different parts of the world, which shows that even if one of the criteria might not apply, data can still be well-modeled by Poisson.

What about soccer matches? We know two goals can't happen at the same time. The length of matches is fixed and soccer is a low-scoring game, so the assumption of a constant rate for goals is probably OK. But what about independence? If you've watched enough soccer, you know that the energy level in a game steps up as soon as a goal is scored. Is this enough to violate the independence requirement? Apparently not, scores in soccer matches are well-modeled by a Poisson distribution.

What should a data scientist do?

Just because the data you're modeling is a count doesn't mean it follows a Poisson distribution. More generally, you should be wary of making choices motivated by convenience. If you have count data, look at the properties of your data before deciding on a distribution to model it with. 

If you liked this blog post you might like

Monday, January 4, 2021

COVID and soccer home team advantage - winning less often

Home advantage

Is it easier for a sports team to win at home? The evidence from sports as diverse as soccer [Pollard], American football [Vergina], rugby [Thomas], and ice hockey [Leard] strongly suggest there is a home advantage and it might be quite large. But what causes it? Is it the crowd cheering the home team, or closeness to home, or playing on familiar turf? One of the weirder side-effects of COVID is the insight it's proving into the origins of home advantage, as we'll see.

(Premier League teams playing in happier times. Image source: Wikimedia Commons, License: Creative Commons, Author: Brian Minkoff)

The EPL - lots of data makes analysis easier

The English Premier League is the world's wealthiest sports' league [Robinson].  There's worldwide interest in the league and there has been for a long time, so there's a lot of data available, which makes it ideal for investigating home advantage. One of the nice features of the league is that each team plays every other team twice, once at home and once away.

Expectation and metric

If there were no home team advantage, we would expect the number of home wins and away wins to be roughly equal for the whole league in a season. To investigate home advantage, the metric I'll use is:
\[home \ win \ proportion = \frac{number\ of\ home\ wins}{total\ number\ of\ wins}\]
If there were no home team advantage, we would expect this number to be close to 0.5.

EPL home team advantage

Let's look at the mean home win proportion per season for the EPL. In the chart, the error bars are the 95% confidence interval.
For most seasons, the home win proportion is about 0.6 and it's significantly above 0.5 (in the statistical sense). In other words, there's a strong home-field advantage in the EPL.

But look at the point on the right. What's going on in 2020-2021?

COVID and home wins

Like everything else in the world, the EPL has been affected by COVID. Teams are playing behind closed doors for the 2020-2021 season. There are no fans singing and chanting in the terraces, there are no fans 'ohhing' over near misses, and there are no fans cheering goals. Teams are still playing matches home and away but in empty and silent stadiums.

So how has this affected home team advantage?

Take a look at the chart above. The 2020-2021 season is the season on the right. Obviously, we're still partway through the season, which is why the error bars are so big, but look at the mean value. If there were no home team advantage, we would expect a mean of 0.5. For 2020-2021, the mean is currently 0.491. 

Let me put this simply. When there are fans in the stadiums, there's a home team advantage. When there are no fans in the stadiums, the home team advantage disappears.

COVID and goals

What about goals? It's possible that a team that might have lost is so encouraged by their fans that they reach a draw instead. Do teams playing at home score more goals?

I worked out the mean goal difference between the home team and the away team and I've plotted it for every season from 2000-2001 onwards.
If there were no home team advantage, you would expect the goal difference to be 0. But it isn't. It mostly hovers around 0.35. Except of course for 2020-2021. For 2020-2021, the goal difference is about zero. The home-field advantage has gone.

What this means

Despite the roll-out of the vaccine, it's almost certain the rest of the 2020-2021 season will be played behind closed doors (assuming the season isn't abandoned). My results are for a partial season, but it's a good bet the final results will be similar. If this is the case, then it will be very strong evidence that fans cheering their team really do make a difference.

If you want your team to win, you need to go to their games and cheer them on. 

References

[Leard] Leard B, Doyle JM. The Effect of Home Advantage, Momentum, and Fighting on Winning in the National Hockey League. Journal of Sports Economics. 2011;12(5):538-560.

[Pollard] Richard Pollard and Gregory Pollard, Home advantage in soccer: a review of its existence and causes, International Journal of Soccer and Science Journal Vol. 3 No 1 2005, pp28-44

[Robinson] Joshua Robinson, Jonathan Clegg, The Club: How the English Premier League Became the Wildest, Richest, Most Disruptive Force in Sports, Mariner Books, 2019

[Thomas] Thomas S, Reeves C, Bell A. Home Advantage in the Six Nations Rugby Union Tournament. Perceptual and Motor Skills. 2008;106(1):113-116

[Vergina] Roger C.Vergina, John J.Sosika, No place like home: an examination of the home field advantage in gambling strategies in NFL football, Journal of Economics and Business Volume 51, Issue 1, January–February 1999, Pages 21-31

Monday, December 7, 2020

How to (maybe) win at dice

Does God play dice with the universe?

Imagine I gave you an ordinary die, not special in any way, and I asked you to throw the die and record your results (how many 1s, how many 2s, etc.). What would you expect the results to be? Do you think you could win by choosing some numbers rather than others? Are you sure?

(Image source: Wikimedia Commons. Author: Diacritica. License: Creative Commons.)

What you might expect

Let's say you thew the die 12,000 times, you might expect a probability distribution something like this. This is a uniform distribution where all results are equally likely.

You know you'll never get an absolutely perfect distribution, so in reality, your results might look something like this for 12,000 throws. 

The deviations from the expected values are random noise which we can quantify. Further, we know that by adding more dice throws, random noise gets less and less and we approach the ideal uniform distribution more closely. 

I've simulated dice throws in the plots below, the top chart is 12,000 throws and the chart on the bottom is 120,000 throws. The blue bars represent the actual results, the black circle represents the expected value, and the black line is the 95% confidence interval. Note how the results for 120,000 throws are closer to the ideal than the results from 12,000 throws.


What happened in reality - not what you expect

My results are simulations, but what happens when you throw dice thousands of times in the real world?

There's a short history of probability theorists and statisticians throwing dice and recording the results.
  • Weldon threw 12 dice 26,306 times by hand and sent the results to his friend Francis Galton.
  • Iversen ran an experiment where 219 dice were rolled 20,000 times.
Weldon's data set is widely used to illustrate statistical concepts, especially after Pearson used it to explain his \(\chi^2\) technique in 1900. 

Despite the excitement you see at the craps tables in Las Vegas, throwing dice thousands of times is dull and is, therefore, an ideal job for a computer. In 2009, Zachariah Labby created apparatus for throwing dice and recording the scores using a camera and image processing. You can read more about his apparatus and experimental setup here. He 'threw' 12 dice 26,306 times and his machine recorded the results. 

In the chart below, the blue bars are his results, the black circle is the expected result, and the black line is the 95% confidence interval. I've taken the results from all 12 dice, so my throw count is \(12 \times 26,306\).
This doesn't look like a uniform distribution. To state the obvious, 1 and 6 occurred more frequently than theory would suggest - the deviation from the uniform distribution is statistically significant. The dice he used were not special dice, they were off-the-shelf standard unbiased dice. What's going on?

Unbiased dice are biased

Take a very close look at a normal die, the type pictured at the start of this post which are the kind of die you buy in shops.

By convention, opposite faces on dice sum to 7, so 1 is opposite 6, 3 is opposite to 4, and so on. Now look very closely again at the picture at the start of the post. Look at the dots on the face of the dice. Notice how they're indented. Each hole is the same size, but obviously, the number of holes on each face is different. Let's think of this in terms of weight. Imagine we could weigh each face of the dice. Let's pair up the faces, each side is paired with the face opposite it. Now let's weigh the faces and compare them.

The greatest imbalance in weights is the 1-6 combination. This imbalance is what's causing the bias.

Obviously, the bias is small, but if you roll the die enough times, even a small bias becomes obvious. 

Vegas here I come - or not...

So we know for dice bought in shops that 1 and 6 are ever so slightly more likely to occur than theory suggests. Now you know this, why aren't you booking your flight to Las Vegas? You could spend a week at the craps tables and make a little money.

Not so fast.

Let's look at the dice they use in Vegas.

(Image source: Wikimedia Commons. Author: Alper Atmaca License: Creative Commons.)

Notice that the dots are not indented. They're filled with colored material that's the same density as the rest of the dice. In other words, there's no imbalance, Vegas dice will give a uniform distribution, and 1 and 6 will occur as often as 2, 3, 4, or 5. You're going to have to keep punching the clock.

Some theory

Things are going to get mathematical from here on in. There won't be any new stories about dice or Vegas.

How did I get the expected count and error bars for each dice score? Let's say I threw the dice \(x\) times, it seems obvious we would get an expected count of \(\frac{x}{6}\) for each score, but why? What about the standard error?

Let's re-think the dice as a Bernoulli trial. Let's choose a score, say 1. If we throw the dice and it shows a 1, we consider that a success. If it shows anything else, we consider it a failure. Because we have a Bernoulli trial, we can use the binomial distribution to model the results.

  • \(n\) is the number of throws
  • \(p\) is the probability of getting a 1, which is \(\frac{1}{6}\)
  • \(q = 1- p\) is the probability of getting 2-6, which is \(\frac{5}{6}\)
So, again using Wikipedia's handy summary, for \(n\) throws:
  • The mean is \(np = 12 \times 26,306 \times \frac{1}{6} = 52,612\)
  • The standard deviation is \(\sqrt{npq} = \sqrt{12 \times 26,306 \times \frac{1}{6} \times \frac{5}{6}} = 209.388\) 
  • The 95% confidence interval is \(52,202 \) to \(53,022\) (standard deviation by 1.96).

Publications

Academics live or die by publications and by citations of their publications. Labby's work has rightly been widely cited on the internet. I keep hoping that some academic will be inspired by Labby and use modern robotic technology and image recognition to do huge (million-plus) classical experiments, like tossing coins or selecting balls from an urn. It seems like an easy win to be widely cited!

Sunday, November 29, 2020

Am I diseased? An introduction to Bayes theorem

What is Bayes' theorem and why is it so important?

Bayes' theorem is one of the key ideas of modern data science; it's enabling more accurate forecasting, it's leading to shorter A/B tests, and it's fundamentally changing statistical practices. In the last twenty years, Bayes' theorem has gone from being a cute probability idea to becoming central to many disciplines. Despite its huge impact, it's a simple statement of probabilities: what is the probability of an event occurring given some other event has occurred? How can something almost trivial be so revolutionary? Why all this change now? In this blog post, I'm going to give you a brief introduction to Bayes' theorem and show you why it's so powerful. 

(Bayes theorem. Source: Wikimedia Commons. Author: Matt Buck. License: Creative Commons.)

A disease example without explicitly using Bayes' theorem

To get going, I want to give you a motivating example that shows you the need for Bayes' theorem. I'm using this problem to introduce the language we'll need. I'll be using basic probability theory to solve this problem and you can find all the theory you need in my previous blog post on probability. This example is adapted from Wayne W. LaMorte's page at BU; he has some great material on probability and it's well worth your time browsing his pages. 

Imagine there's a town of 10,000 people. 1% of the town's population has a disease. Fortunately, there's a very good test for the disease:

  • If you have the disease, the test will give a positive result 99% of the time (sensitivity).
  • If you don't have the disease, the test will give a negative result 99% of the time (specificity).

You go into the clinic one day and take the test. You get a positive result. What's the probability you have the disease? Before you go on, think about your answer and the why behind it.

Let's start with some notation.

  • D+ and D- represent having the disease and not having the disease
  • T+ and T- represent testing positive and testing negative
  • P(D+) represents the probability of having the disease (with similar meanings for P(D-), P(T+), P(T-))
  • P(T+ | D+) is the probability of testing positive given that you have the disease.

We can write out what we know so far:

  • P(D+) = 0.01
  • P(T+ | D+) = 0.99
  • P(T- | D-) = 0.99

We want to know P(D+ | T+). I'm going to build a decision tree to calculate what I need.

There are 10,000 people in the town, and 1% of them have the disease. We can draw this in a tree diagram like so.

For each of the branches, D+ and D-, we can draw branches that show the test results T+ and T-:



For example, we know 100 people have the disease, of whom 99% will test positive, which means 1% will test negative. Similarly, for those who do not have the disease, (9,900), 99% will test negative (9,801), and 1% will test positive (99).

Out of 198 people who tested positive for the disease (P(T+) = P(T+ | D+) + P(T+ | D-)), 99 people have it, so P(D+ | T+) = 99/198. In other words, if I test positive for the disease, I have a 50% chance of actually having it.

There are two takeaways from all of this:
  • Wow! Really, only a 50% probability! I thought it would be much higher! (This is called the base rate fallacy).
  • This is a really tedious process and probably doesn't scale. Can we do better? (Yes: Bayes' theorem.)

Who was Bayes?

Thomas Bayes (1702-1761), was an English non-conformist minister (meaning a protestant minister not part of the established Church of England). His religious duties left him time for mathematical exploration, which he did for his own pleasure and amusement; he never published in his lifetime in his own name. After his death, his friend and executor, Richard Price, went through his papers and found an interesting result, which we now call Bayes theorem.  Price presented it at the Royal Society and the result was shared with the mathematical community. 


(Plaque commemorating Thomas Bayes. Source: Wikimedia Commons Author:Simon Harriyott License: Creative Commons.)

For those of you who live in London, or visit London, you can visit the Thomas Bayes memorial in the historic Bunhill Cemetery where Bayes is buried. For the true probability pilgrim, it might also be worth visiting Richard Price's grave which is only a short distance away.

Bayes' theorem

The derivation of Bayes' theorem is almost trivial. From basic probability theory:

\[P(A  \cap B) = P(A) P(B | A)\]
\[P(A \cap B) =  P(B \cap A)\]

With some re-arranging we get the infamous theorem:

\[P(A | B) = \frac{P(B | A) P(A)}{P(B)}\]

Although this is the most compact version of the theorem, it's more usefully written as:

\[P(A | B) = \frac{P(B | A) P(A)}{P(B \cap A) + P(B \cap \bar A)} = \frac{P(B | A)P(A)}{P(B | A)P(A) + P(B | \bar A) P( \bar A)}\]

where \(\bar A\) means not A (remember \(1 = P(A) + P(\bar A)\)). You can get this second form of Bayes using the law of total probability and the multiplication rule (see my previous blog post).

So what does it all mean and why is there so much excitement over something so trivial?

What does Bayes' theorem mean?

The core idea of Bayesian statistics is that we update our prior beliefs as new data becomes available - we go from the prior to the posterior. This process is often iterative and is called the diachronic interpretation of Bayes theorem. It usually requires some computation; something that's reasonable to do given today's computing power and the free availability of numeric computing languages. This form of Bayes is often written:

\[P(H | D) = \frac{P(D | H) P(H)}{P(D)}\]

with these definitions:

  • P(H) - the probability of the hypothesis before the new data - often called the prior
  • P(H | D) - the probability of the hypothesis after the data - the posterior
  • P(D | H) - the probability of the data under the hypothesis, the likelihood
  • P(D) - the probability of the data, it's called the normalizing constant

A good example of the use of Bayes' theorem is its use to better quantify the health risk an individual faces from a disease. Let's say the risk of suffering a heart attack in any year is P(HA), however, this is for the population as a whole (the prior). If someone smokes, the probability becomes P(HA | S), which is the posterior, which may be considerably different from P(HA). 

Let's use some examples to figure out how Bayes works in practice.

The disease example using Bayes

Let's start from this version of Bayes:
\[P(A | B) = \frac{P(B | A)P(A)}{P(B | A)P(A) + P(B | \bar A) P( \bar A)}\]
and use the notation from our disease example:
\[P(D+ | T+) = \frac{P(T+ | D+)P(D+)}{P(T+ | D+)P(D+) + P(T+ | D-) P( D-)}\]
Here's what we know from our previous disease example:
  • P(D+) = 0.01 and by implication P(D-) = 0.99
  • P(T+ | D+) = 0.99 
  • P(T- | D-) = 0.99 and by implication P(T+ | D-) = 0.01

Plugging in the numbers:

\[P(D+ | T+) = \frac{0.99\times0.01}{0.99\times0.01 + 0.01\times0.99} = 0.5\]

The decision tree is easier for a human to understand, but if there are a large number of conditions, it becomes much harder to use. For a computer on the other hand, the Bayes solution is straightforward to code and it's expandable for a large number of conditions.

Predicting US presidential election results

I've blogged a lot about this, but not about using Bayesian methods. The basic concepts are fairly simple.

  • To predict a winner, you need to model the electoral college, which implies a state-by-state forecast.
  • For each state, you know who won last time, so you have a prior in the Bayesian sense.
  • In competitive states, there are a number of opinion polls that provide evidence of voter intention, this is the data or normalizing constant in Bayes-speak.

In practice, you start with a state-by-state prior based on previous elections or fundamentals, or something else. As opinion polls are published, you calculate a posterior probability for each of the parties to win the state election. Of course, you do this with Bayes theorem. As more polls come in, you update your model and the influence of your prior becomes less and less. In some versions of this type of modeling work, models take into account national polling trends too.

The landmark paper describing this type of modeling is by Linzer.

Using Bayes' theorem to prove the existence of God

Over history, there have been many attempts to prove the existence of God using scientific or mathematical methods. All of them have floundered for one reason or another. Interestingly, one of the first uses of Bayes' theorem was to try and prove the existence of God by proving miracles can happen. The argument was put forward by Richard Price himself. I'm going to repeat his analysis using modern notation, based on an explanation from Cornell University.

Price's argument is based on tides. We expect tides to happen every day, but if a tide doesn't happen, that would be a miracle. If T is the consistency of tides, and M is a miracle (no tide), then we can use Bayes theorem as:

\[P(M | T) = \frac{P(T | M) P(M)}{P(T | M) P(M) + P(T | \bar M) P(\bar M)}\]

Price assumed the probability of miracles existing was the same as the probability of miracles not existing (!), so \(P(M) = P(\bar M)\). If we plug this into the equation above and simplify, we get:

\[P(M | T) = \frac{P(T | M)}{P(T | M) + P(T | \bar M)}\]

He further assumed that if miracles exist, they would be very rare (or we would see them all the time), so:

\[P(T | \bar M) >> P(T | M)\]

he further assumed that \(P(T | M) = 1e^{-6}\) - in other words, if a miracle exists, it would happen 1 time in 1 million. He also assumed that if there were no miracles, tides would always happen, so \(P(T | \bar M) = 1\). The upshot of all this is that:

\[P(M | T) = 0.000001\]

or, there's a 1 in a million chance of a miracle happening.

There are more holes in this argument than in a teabag, but it is an interesting use of Bayes' theorem and does give you some indication of how it might be used to solve other problems.

Monty Hall and Bayes

The Monty Hall problem has tripped people up for decades (see my previous post on the problem). Using Bayes' theorem, we can rigorously solve it.

Here's the problem. You're on a game show hosted by Monty Hall and your goal is to win the car. He shows you three doors and asks you to choose one. Behind two of the doors are goats and behind one of the doors is a car. Once you've chosen your door, Monty opens one of the other doors to show you what's behind it. He always chooses a door with a goat behind it. Next, he asks you the key question: "do you want to change doors?". Should you change doors and why?

I'm going to use the diachronic interpretation of Bayes theorem to figure out what happens if we don't change:

\[P(H | D) = \frac{P(D | H) P(H)}{P(D)} = \frac{P(D | H) P(H)}{P(D | H)P(H) + P(D | \bar H) P( \bar D)}\]
  • \(P(H)\) is the probability our initial choice of door has a car behind it, which is \(\frac{1}{3}\).
  • \( P( \bar H) = 1- P(H) = \frac{2}{3} \)
  • \(P(D | H) = 1\) this is the probability Monty will show me a door with a goat given that I have chosen the door with a car - it's always 1 because Monty always shows me the door with a goat
  • \(P(D | \bar H) = 1\) this is the probability Monty will show me a door with a goat given that I have chosen the door with a goat - it's always 1 because Monty always shows me the door with a goat,

Plugging these numbers in:

\[P(H | D) = \frac{1 \times \frac{1}{3}}{1 \times \frac{1}{3} + 1 \times \frac{2}{3}} = \frac{1}{3}\]

If we don't change, then the probability of winning is the same as if Monty hadn't opened the other door. But there are only two doors, and \(P(\bar H) + P(H) = 1\). In turn, this means our winning probability if we switch is \(\frac{2}{3}\), so our best strategy is switching.

Searching for crashed planes and shipwrecks

On 1st June 2009, Air France Flight AF 447 crashed into the Atlantic. Although the flight had been tracked, the underwater search for the plane was complex. The initial search used Bayesian inference to try and locate where on the ocean floor the plane might be. It used data from previous crashes that assumed the underwater locator beacon was working. Sadly, the initial search didn't find the plane.

In 2011, a new team re-examined the data, with two crucial differences. Firstly, they had data from the first search, and secondly, they assumed the underwater locator beacon had failed. Again using Bayesian inference, they pointed to an area of ocean that had already been searched. The ocean was searched again (with the assumption the underwater beacon had failed), and this time the plane was found.

You can read more about this story in the MIT Technology Review and for more in-depth details, you can read the paper by the team that did the analysis.

It turns out, there's quite a long history of analysts using Bayes theorem to locate missing ships. In this 1971 paper, Richardson and Stone show how it was used to locate the wreckage of the USS Scorpion. Since then, a number of high-profile wrecks have been located using similar methods.

Sadly, even Bayes theorem hasn't led to anyone finding flight MH370. 

Other examples of Bayes theorem

Bayes has been applied in many, many disciplines. I'm not going to give you an exhaustive list, but I will give you some of the more 'fun' ones.

Why now?

Using Bayes theorem can involve a lot of fairly tedious arithmetic. If the problem requires many iterations, there are lots of tedious calculations. This held up the adoption of Bayesian methods until three things happened:

  • Cheap computing.
  • The free and easy availability of mathematical computing languages.
  • Widespread skill to program in these languages.

By the late 1980's, computing power was sufficiently cheap to make Bayesian methods viable, and of course, computing has only gotten cheaper since then. Good quality mathematical languages were available by the late 1980's too (e.g. Fortran, MATLAB), but by the 2010s, Python and R had all the necessary functionality and were freely and easily available. Both Python and R usage had been growing for a while, but by the 2010s, there was a very large pool of people who were fluent in them.

As they say in murder mysteries, by the 2010s, Bayesian methods had the means, the motive, and the opportunity.

Bayes and the remaking of statistics

Traditional (non-Bayesian) statistics are usually called frequentist statistics. It has a long history and has been very successful, but it has problems. In the last 50 years, Bayesian analysis has become more successful and is now challenging frequentist statistics. 

I'm not going to provide an in-depth critique of frequentist statistics here, but I will give you a high-level summary of some of the problems.

  • p-values and significance levels are prone to misunderstandings - and the choice of significance levels is arbitrary
  • Much of the language surrounding statistical tests is complex and rests on convention rather than underlying theory
  • The null hypothesis test is frequently misunderstood and misinterpreted
  • Prior information is mostly ignored.

Bayesian methods help put statistics on a firmer intellectual foundation, but the price is changing well understood and working frequentist statistics. In my opinion, over the next twenty years, we'll see Bayesian methods filter down to undergraduate level and gradually replace the frequentist approach. But for right now, the frequentists rule.

Conclusion

At its heart, Bayes' theorem is almost trivial, but it's come to represent a philosophy and approach to statistical analysis that modern computing has enabled; it's about updating your beliefs with new information. A welcome side-effect is that it's changing statistical practice and putting it on a firmer theoretical foundation. Widespread change to Bayesian methods will take time, however, especially because frequentist statistics are so successful.

Reading more

Monday, November 23, 2020

Chance encounters of the third kind: understanding probability

Why look back at basic probability?

Bayes' theorem lies at the heart of much of modern machine learning. Although it's relatively simple to understand, you do need some grounding in probability theory. This blog post is all about getting you up close and personal with probability theory so I can tell you all about Bayes in a later post.

(You can work out the probability aliens are on earth given that Elvis lives. Image source: Pixabay Author: Pete Linforth License: Pixabay.)

The very basics

Think of some event that might occur in the future, say winning the lottery, buying a new car, or England winning the World Cup. We can estimate the probability of these events happening; we can call the event A and the probability of the event occurring P(A). If the event is certain to occur, then P(A) =1, if it's certain not to occur, then P(A) = 0, and in all cases: 0 \(\leq \) P(A) \(\leq \) 1.

We'll consider the probability of several events I'm going to call A, B, C, etc. These can be any events at all, including aliens landing, Elvis making a comeback, or getting a pay raise at the end of the year.

The complementary rule

If the probability of an event A occurring is P(A), the probability of it not occurring is \(1 - P(A)\). This is called the complement and different authors use different notation for it:
\[1 - P(A) = P(A^c) = P(A-) = P( \bar A) = P( \raise.25ex\hbox{$\scriptstyle\sim$} A) \]
Let me give you an example using one notation. Imagine 1% of the population have a disease and 99% don't, then:
\[1 = P(D+) + P(D-) = 0.01 + 0.99\]

Independence

Independence is a huge issue in probability modeling and it can lead to big errors if not handled correctly. On the face of it, it's a simple idea, but there are subtleties.

Two events are independent if one does not affect or influence the other in any way (alternatively, one event does not give any information about the other). For example, the odds of Joe Biden winning the 2020 Presidential election do not depend on the odds of New Zealand opening its borders to international travelers. Looking at things the other way, the odds of me winning the lottery are dependent on my purchasing a ticket (I have to buy a ticket to stand any chance of winning) - these are dependent events. I'm sure you can think of many other examples.

Independent and dependent events are treated very differently mathematically, the big mistake comes when events that are not independent are considered to be independent. For example, an organization might run many opinion polls in an election. The errors in the polls will not be independent of one another because the organization may well have a systemic bias that affects all their polls. There are similar problems in epidemiology; if you and I live together, my probability of catching an infectious disease is not independent of your probability of catching an infectious disease. The most famous example of confusing independent and dependent events was the sub-prime mortgage scandals of 2008 onwards. The analysts who developed the sub-prime mortgage default models assumed that mortgage defaults were independent of one another. Unfortunately for all of us, that wasn't the case in 2008. Economic conditions led to many defaults, which in turn led to broader financial problems, which in turn led to more defaults. In 2008 and onwards, sub-prime mortgage defaults were dependent on one another.

Disjoint (mutually exclusive) events

Two events are disjoint if they're mutually exclusive, in other words, if both can't happen. For example, only one of Joe Biden or Donald Trump can win the election - they both can't be President. In notation I'll explain later: \( P(A  \ and \  B) = P(A \cap B) = 0\).

Probability A and B occurring (intersection) - the multiplication rule

What's the probability of A and B occurring (also known as their joint or conjoint probability)? Here's where we run into some notation issues. Some sources write 'and' and some use the symbol '\(\cap\)' - both mean the same thing.

Here's the rule for dependent events:
\[P(A \ and \ B) = P(A  \cap B) = P(A) P(B | A)\]
Here's the rule for independent events:
\[P(A \ and \ B) = P(A  \cap B) = P(A) P(B)\]
Here's the rule for disjoint events:
\[ P(A \ and \ B) = P(A \cap B) = 0\]
The and relationship is commutative:
\[P(A \cap B) = P(B \cap A)\]

Probability of A or B occurring (union) - the addition rule

What's the probability of A or B occurring? Some sources write 'or' and some write '\(\cup\)'. Here's the rule:
\[P(A \ or \ B) = P(A  \cup B) = P(A) +  P(B) - P(A \cap B) \]

\[= P(A) + P(B) - P(A)P(B | A)\]

The or relationship is commutative:
\[P(A \cup B) = P(B \cup A)\]

For disjoint events, the addition rule simplifies to:

\[P(A \ and \ B) = P(A  \cup B) = P(A) +  P(B)  \]

because from before we have:

\[P(A \cap B) =  0\]

Conditional probability - the conditional rule

What's the probability I have a disease given I've tested positive for the disease? We use the | symbol to mean "given that", so P(A | B) means the probability of A happening given that B has occurred.  Here are some examples from everyday life:
  • What's the probability I win the lottery given that I've bought a ticket?
  • What's the probability I will get a degree if I go to college?
  • What's the probability I will have an accident if I'm driving and if it's snowing and if it's dark?
The interesting thing about conditional probability is that it can be quite different from the 'raw' probability. For example, let's say you're from a poor family, you might only have a 10% chance of getting a degree, but if you get accepted to a college, the probability might shoot up to 50%, and if you actually go to college, the probability may get to 95%. The probability can change quite substantially depending on new information (as we'll see with Bayes' theorem).

The general rule is:
\[P(A | B) = \frac{P(B \cap A)}{P(B)}\]
If A and B are independent (A does not depend on B), then P(A | B) = P(A).

The law of total probability

There's a general form of this law and a more specific form. Because the specific form will be useful for Bayesian work later, we'll start with that.
\[P(A+) = P(A+ \cap \ B+) + P(A+ \cap \ B-)\]
In words, the probability of an event A+ occurring is the probability of the event A+ occurring and the event B+ occurring plus the probability of event A+ occurring and the probability of event B+ not occurring (B-). This might be clearer if we remember \(1 = P(B+) + P(B-)\) and we think of probabilities using a Venn diagram.

The more general form of this law is:
\[P(A) = \sum_i{P(A \cap B_i)} = \sum_i{P(A | B_i)P(B_i)}\]

The law of total probability and conditional probabilities

One of the most useful forms of Bayes' theorem relies on the combination of the law of total probability and conditional probability. Here's the key relationship:
\[1 = P(A | B) + P(\bar A | B)\]
Let me put this into words. If event B happens, then either A or not A happens, there are no other options, so the two probabilities must sum to 1.

What use is probability theory?

I grew up hearing about the value of 'common sense', but probability theory often gives results that seem very counterintuitive and 'common sense' can lead you wildly astray. A fun example is the Monty Hall problem, but there are lots of other examples in the real world where the probability of something happening is not what it appears to be at first - and they're not so fun. The counter-intuitive example you find most often on the internet is the probability that you have a disease given a positive test result; it's mostly not what you think.

Bayes' theorem takes us into the world of the counter-intuitive and I'll talk about Bayes in a future blog post.

Monday, September 21, 2020

The gamblers' fallacy

An offer you can't refuse?

Imagine you're in a casino playing craps, a game where you bet on the outcome of two dice thrown at the same time. The probability of a double six coming up is 1/36, but no-one has thrown a double six for over 110 throws. The table is starting to get crowded and noisy with people betting on a double six. It's due to come up, and it must come up soon. 

(Still no double six. Source: Wikimedia Commons. License: Creative Commons. Author: Gaz.)

A new player rolls the dice; snake-eyes (double ones) - still no double six.

You feel a tap on your elbow. A lady in a cocktail dress whispers to you that she'll give you odds of 20 to 1 for a double six.

Another player rolls the dice; easy-four (one and three) - the expectation for a double six mounts.

Your new friend whispers that she'll reduce the odds soon; she asks if you want to take the bet.

It's now 130 throws since a double six has occurred and it should have occurred 3 or 4 times by now.

Do you take the bet?

The gambler's fallacy 

The gambler's fallacy is the belief that the outcome of a random event is somehow influenced by the previous random events. In our craps case, some examples might be:

  • double six hasn't come up in 130 throws, so it's much more likely to come up now (the probability is higher than 1/36)
  • double one has just come up, therefore it's not likely to come up again soon (the probability is less than 1/36).

It's a fallacy because each roll of the dice is completely independent; it doesn't matter what the previous throws were. There could have been 1,000 throws without a double six, but the probability of a double six will always be 1/36. The same thing for the snake-eyes example, if a snake-eyes has been thrown, the probability of throwing another snake-eyes immediately after is still 1/36.

Let me lay this out even more starkly, in craps:

  • At the very first roll of the dice, the probability of a double six is 1/36.
  • After ten rolls of the dice, the probability of the next roll being a double six are 1/36.
  • After 100 rolls without a double six, the probability of the next roll being a double six is 1/36.
  • After 200 rolls without a double six, the probability of the next roll being a double six is 1/36.
  • After 1,000 rolls without a double six, the probability of the next roll being a double six is 1/36.

Otherwise rational people are fooled by the gambler's fallacy all the time. As the money increases and the emotion heightens, the gambler's fallacy becomes easier and easier to fall for, as we'll see.

The Italian lottery

The story starts in Venice, Italy in May 2003. The Venice lottery was a game where 6 numbered balls (plus a bonus ball) were selected from a set of 90 numbered balls. The lottery was run twice a week. Each number should come out on average once every 7-8 weeks. As with all government-sponsored lotteries, the results were well-publicized.

In May 2003, the number 53 came up. Then it didn't come up again.

By October, people realized the number 53 was overdue. They started to gamble on 53 occurring - it was overdue, so it must come up. But 53 just didn't come up.

News of the 53 drought started to spread, and more and more Italians started to bet that 53 would occur, but it didn't. It didn't come up in November or December either. 

In January of 2004, a woman from Carrara committed suicide because she'd spent her family's life-saving gambling that 53 would come up. It didn't.

Still, 53 didn't come up.

People went crazy betting money that 53 would come up, they became known as '53 addicts'. They were sure it must come up. Sadly, it didn't. A man from Signa shot his wife, his son, and himself after losing money gambling on 53.

Still, 53 didn't come up.

Italians gambled and lost a huge amount of money on 53, an estimated 4 billion Euros. They had fallen for the gambler's fallacy and believed that 53 must come up soon.

Eventually, 53 did come up - in February 2005, after 182 draws (remember, each draw was seven balls).

The Venice lottery made a lot of money, but the Italian gamblers did not.

How the cocktail dress lady (and casinos) makes money

To understand if the cocktail dress lady was offering a good deal, we need to relate probability to odds.

The probability of a double six is 1/36.

The odds are the ratio of the probability the event will occur divided by the probability the event will not occur:

\[odds = \frac{P}{1-P}\]

The odds of a double six are:

\[ odds_{66} = \frac{\frac{1}{36}}{\frac{35}{36}} = \frac {1}{35}\]

which a bookie might quote as 35 to 1.

Generally speaking, casinos and bookies make money in one of two ways:

  • The probabilities don't add up to 1.
  • They rely on the gamblers' fallacy and offer worse odds than a fair analysis would suggest.

Let's imaging there are ten horses in a race. Each horse has a 10% chance of winning, which are odds of 9 to 1. If you win, you get your stake money back, so a winning bet of $1 gives you $10. If ten punters bet $1 on each horse, the bookie takes $10, but one of the horses must win, so the bookie pays out $10. 

(Bookmakers make money. You don't. Image source: Wikimedia Commons. License: Creative Commons. Author: Grand Island Tourism )

To make money, the bookie reduces the odds. Instead of offering 9 to 1 on each of the horses, the bookie offers 8 to 1. The bookie still takes in $10, but this time only pays out $9. In the real world, it's more complicated, but you get the idea.

The other way to make money is to underprice probabilities. A double six should be offered at 35 to 1, but you could offer it at 20 to 1. This is a horrible deal, but if gamblers have a bad case of the gambler's fallacy, they may be convinced the probability is much higher than 1/36 and they may even view a horrible deal as the deal of a lifetime. The casino, or the lady in the cocktail dress, make money by knowing the odds and knowing when to offer a deal that seems attractive, but isn't.

Not only should you not accept the 20 to 1 offer, you should offer it to other players.

Gambler's fallacy in Reno, Nevada and Monte Carlo

Obviously, there are naive gamblers in Las Vegas, but do people really fall for the gamblers' fallacy at the roulette table? After all, you have to have some level of sophistication to understand and play the game, so surely gamblers are savvy and know how to price bets appropriately? It seems that they don't always.

Using videotape data supplied by a casino in Reno, Nevada, two researchers tracked the pattern of gambling on roulette. If gamblers have fallen for the gambler's fallacy, you might expect to see certain patterns of betting, for example, if red hasn't come up as often as expected, they might bet more on red. The researchers found small, but significant examples of the gambler's fallacy The reality is then, there are people who fall for the fallacy, even those playing a sophisticated game like roulette.

(Image source: Wikimedia Commons. License: Creative Commons. Author: Ken Lund.)

Another object lesson in the gambler's fallacy occurred at a roulette table in a casino, this time at a casino in Monte Carlo. In 1943, the ball landed on red 32 times in a row. The people who thought black must come up were cleaned out. 

The gamblers' fallacy elsewhere

The gambler's fallacy has been an active area of research for some time. Variations of it have been found in different places:

Let's imagine you're an asylum judge. You're aware of the average 'success' rate for applicants and you don't want to be too far from the average. Let's assume the cases are randomly assigned (deserving and undeserving). By random chance, you might get a long string of deserving or undeserving cases, maybe as many as twenty in a row. The gambler's fallacy may kick in after a series of similar cases, for example, the first ten cases were deserving, so the eleventh 'must' be undeserving, as a result, you judge more harshly based on expectation.

The gambler's fallacy in business

If you listen closely enough, you hear business people make the gambler's fallacy all the time. How often have you heard these kinds of phrases:

  • We've won the last 8 contracts, so we must win the next one.
  • We just failed to land the last 6 deals, so the odds of us landing the next deal are high.

Despite what people say, business can be strongly driven by belief and not rationality. If everyone needs a deal to be landed, then the collective view might become that a deal will be landed, regardless of what a realistic measure of the probabilities are.

How to guard against the gamblers' fallacy

There's something about humanity and our (mis)understanding of statistics that makes us vulnerable to the gambler's fallacy. The best teacher might be experience. How many Italians who bet on 53 would do so again? There's some evidence that the gambler's fallacy is particularly strong when the data evolves over time, which ties in with the Italian lottery and casino examples. Perhaps the best defense is to take a step back and view the data as a whole, then make a decision away from the influence of others.

The existence of opulent casinos should be a lesson that those who understand probability can make money from those who do not.

Wednesday, August 12, 2020

Who will win the election? Election victory probabilities from opinion polls

Polls to probabilities

How likely is it that your favorite candidate will win the election? If your candidate is ahead of their opponent by 5%, are they certain to win? What about 10%? Or if they're down by 2%, are they out of the race? Victory probabilities are related to how far ahead or behind a candidate is in the polls, but the relationship isn't a simple one and has some surprising consequences as we'll see.

Opinion poll example

Let's imagine there's a hard-fought election between candidates A and B. A newspaper publishes an opinion poll a few days before the election:

  • Candidate A: 52%
  • Candidate B: 48%
  • Sample size: 1,000

Should candidate A's supporters pop the champagne and candidate B's supporters start crying?

The spread and standard error

Let's use some standard notation. From the theory of proportions, the mean and standard error for the proportion of respondents who chose A is:

\[ p_a = {n_a \over n} \] \[ \sigma_a = { \sqrt {{p_a(1-p_a)} \over n}} \]

where \( n_a \) is the number of respondents who chose A and \( n \) is the total number of respondents. If the proportion of people who answered candidate B is \(p_b\), then obviously, \( p_a + p_b = 1\).

Election probability theory usually uses the spread, \(d\), which is the difference between the candidates: \[d = p_a - p_b = 2p_a - 1 \] From statistics theory, the standard error of \( d \)  is: \[\sigma_d = 2\sigma_a\] (these relationships are easy to prove, but a bit tedious, if anyone asks, I'll show the proof.)

Obviously, for a candidate to win, their spread, \(d\), must be > 0.

Everything is normal

From the central limit theorem (CLT), we know \(p_a\) and \(p_b\) are normally distributed, and also from the CLT, we know \(d\) is normally distributed. The next step to probability is viewing the normal distribution for candidate A's spread. The chart below shows the normal distribution with mean \(d\) and standard error \(\sigma_d\).

As with most things with the normal distribution, it's easier if we transform everything to the standard normal using the transformation: \[z = {(x - d) \over \sigma_d}\] The chart below is the standard normal representation of the same data.

The standard normal form of this distribution is a probability density function. We want the probability that \(d>0\) which is the light green shaded area, so it's time to turn to the cumulative distribution function (CDF), and its complement, the complementary cumulative distribution function (CCDF).

CDF and CCDF

The CDF gives us the probability that we will get a result less than or equal to some value I'll label \(z_c\). We can write this as: \[P(z \leq z_c) = CDF(z_c) = \phi(z_c) \] The CCDF is defined so that: \[1 = P(z \leq z_c) + P(z > z_c)= CDF(z_c) + CCDF(z_c) = \phi(z_c) + \phi_c(z_c)\] Which is a long-winded way of saying the CCDF is defined as:  \[CCDF(z_c) = P(z_c \gt 0) = \phi_c(z_c)\]

The CDF is the integral of the PDF, and from standard textbooks: \[ \phi(z_c) = {1 \over 2} \left( 1 + erf\left( {z_c \over \sqrt2} \right) \right) \] We want the CCDF,  \(P(z > z_c)\), which is simply 1 - CDF.

Our critical value occurs when the spread is zero. The transformation to the standard normal in this case is: \[z_c = {(x - d) \over \sigma_d} = {-d \over \sigma_d}\] We can write the CCDF as: \[\phi_c(z_c) = 1 - \phi(z_c) = 1- {1 \over 2} \left( 1 + erf\left( {z_c \over \sqrt2} \right) \right)\ \] \[= 1 - {1 \over 2} \left( 1 + erf\left( {-d \over {\sigma_d\sqrt2}} \right) \right)\] We can easily show that: \[erf(x) = -erf(-x)\] Using this relationship, we can rewrite the above equation as: \[ P(d > 0) = {1 \over 2} \left( 1 + erf\left( {d \over {\sigma_d\sqrt2}} \right) \right)\]

What we have is an equation that takes data we've derived from an opinion poll and gives us a probability of a candidate winning.

Probabilities for our example

For candidate A:

  • \(n=1000\)
  • \( p_a = {520 \over 1000} = 0.52 \)
  • \(\alpha_a = 0.016 \)
  • \(d = {{520 - 480} \over 1000} = 0.04\)
  • \(\alpha_d = 0.032\)
  • \(P(d > 0) = 90\%\)

For candidate B:

  • \(n=1000\)
  • \( p_b = {480 \over 1000} = 0.48 \)
  • \(\alpha_b = 0.016 \)
  • \(d = {{480 - 520} \over 1000} = -0.04\)
  • \(\alpha_d = 0.032\)
  • \(P(d > 0) = 10\%\)

Obviously, the two probabilities add up to 1. But note the probability for candidate A. Did you expect a number like this? A 4% point lead in the polls giving a 90% chance of victory?

Some consequences

Because the probability is based on \( erf \), you can quite quickly get to highly probable events as I'm going to show in an example. I've plotted the probability for candidate A for various leads (spreads) in the polls. Most polls nowadays tend to have about 800 or so respondents (some are more and some are a lot less), so I've taken 800 as my poll size. Obviously, if the spread is zero, the election is 50%:50%. Note how quickly the probability of victory increases as the spread increases.

What about the size of the poll, how does that change things? Let's fix the spread to 2% and vary the size of the poll from 200 to 2,000 (the usual upper and lower bounds on poll sizes). Here's how the probability varies with poll size for a spread of 2%.

Now imagine you're a cynical and seasoned poll analyst working on candidate A's campaign. The young and excitable intern comes rushing in, shouting to everyone that A is ahead in the polls! You ask the intern two questions, and then, like the Oracle at Delphi, you predict happiness or not. What two questions do you ask?

  • What's the spread?
  • What's the size of the poll?

What's missing

There are two elephants in the room, and I've been avoiding talking about them. Can you guess what they are?

All of this analysis assumes the only source of error is random noise. In other words, that there's no systemic bias. In the real world, that's not true. Polls aren't wholly based on random sampling, and the sampling method can introduce bias. I haven't modeled it at all in this analysis. There are at least two systemic biases:

  • Pollster house effects arising from house sampling methods
  • Election effects arising from different population groups voting in different ways compared to previous elections.

Understanding and allowing for bias is key to making a successful election forecast. This is an advanced topic for another blog post.

The other missing item is more subtle. It's undecided voters. Imagine there are two elections and two opinion polls. Both polls have 1,000 respondents.

Election 1:

  • Candidate A chosen by 20%
  • Candidate B chosen by 10%
  • Undecided voters are 70%
  • Spread is 10%
Election 2:

  • Candidate A chosen by 55%
  • Candidate B chosen by 45%
  • Undecided voters are 0%
  • Spread is 10%
In both elections, the spread from the polls is 10%, so candidate A has the same higher chance of winning in both elections, but this doesn't seem right. Intuitively, we should be less certain about an election with a high number of undecided voters. Modeling undecided voters is a topic for another blog post!

Reading more

The best source of election analysis I've read is in the book "Introduction to data science" and the associated edX course "Inference and modeling", both by Rafael Irizarry. The analysis in this blog post was culled from multiple books and websites, each of which only gave part of the story.

If you liked this post, you might like these ones