Sunday, December 13, 2020

What's a probability distribution?

Why should you care about probability distributions?

Using the wrong probability distribution can be extremely expensive for businesses:

for businesses using machinery (factories, vehicles, aircraft, etc.), it can lead to parts being changed too frequently or too infrequently
for businesses relying on returning customers, it can lead to substantial under or over-estimates of revenue and/or targeting the wrong customers with promotions
for businesses forecasting future sales by territory and/or product, it can lead to poor territory allocation or poor product resource allocation.

Given that it's so important, what is a probability distribution, and what are some examples?

What's a probability distribution?

At its simplest, a probability distribution tells you how likely an outcome is given some input. For example, how is sales probability distributed by price, or how likely is a component to fail in the next month?

If something is certain to occur, the probability is 1, if it's certain not to occur, the probability is zero. Let's imagine a component lasts a maximum of 6 months before failure. Our probability distribution might show the probability of failure on days 1 to 180. The sum of all failure probabilities for all days must sum to 1.

In the real world, data is noisy and we don't expect real data to exactly follow theoretical distributions, but given enough data, the match should be close enough for us to reason about what's going on.

Uniform distribution - gambling and manufacturing

If the probability is the same for all input values, the distribution is uniform.

Let's imagine we're manufacturing candy, and we want to have equal numbers of red, blue, green, black, and white sweets in a packet. In theory, here's what we should observe.

But in reality, there's random noise so we might see something like this below. We can quantify the difference between the expected distribution and the actual distribution, which tells us something about the variability in the manufacturing process.

The uniform distribution also occurs in gambling, for example, lotteries or dice games.

Reading more

Uniform distribution description by NIST

Binomial distribution - pass/fail and conversion

Each customer who comes into a store or who visits a website will either buy or not buy, which we can turn into a conversion rate. We can model these kinds of pass/fail processes using the binomial distribution. Here's the probability distribution.

The binomial distribution shows us the probability of measuring different results given an underlying 'truth'. Let's imagine the 'true' conversion rate was 0.04, we might not measure 0.04 due to sampling error, instead, we might measure 0.045 or 0.055, depending on how many samples we take. It's important to understand what this means:

There is uncertainty in our measurement.
The smaller the sample, the bigger the uncertainty.

Although many technical people understand this, most non-technical people do not, which can lead to tension.

Reading more

Yale stats

Poisson distribution - waiting in line

Imagine you're a bank serving customers with ATMs at a location. ATMs are expensive, but you don't want to keep people waiting in long lines to do their transactions, it's bad for business. So how do you balance the cost of an ATM against its use? By modeling how many people are using the ATM over a time period.

It turns out, the number of people who visit an ATM over a time period can be modeled using the Poisson distribution, which I've shown below. This gives us a way of assessing how much variation there might be in usage and therefore how many machines we might want to install.

The Poisson distribution is often used to model counting processes. It's very attractive because it has an unusual feature, the standard deviation for the distribution is $\sqrt{\gamma}$ where $\gamma$ is the mean. Unfortunately, this property makes it a little too attractive; it's sometimes used when it shouldn't be.

Reading more

The Poisson Distribution and Poisson Process Explained

Exponential distribution

How long does a car battery last? How long do phone calls last? When will the next earthquake occur? These durations typically follow the exponential distribution (which is strongly related to the Poisson distribution). I've shown this distribution below.

Reading more

The exponential distribution

Power law distribution - finding fraud

How are incomes distributed in a population? How might you find fraud in the pattern of digits in expenses? It turns out, the distribution of the first digits in invoices follows a power-law distribution. The chart below shows a generic power-law distribution - for fraud detection, it's 'flipped'.

Reading more

Power law distribution

Normal distribution - almost everywhere, but not quite

What's the probability distribution for male soldiers' chest measurements? How are the results of A/B tests distributed? What about the distribution of measurement errors? All these, and many, many more follow the normal distribution, which is also called the Gaussian distribution or the bell curve. If you only learn one distribution, this is the one to learn.

The properties of this distribution are extremely well-known, and every student of statistics and probability theory will know them. It's ubiquitous because of something called the Central Limit Theorem, which, simplifying a great deal, says that the sum of samples from any distribution follows a normal distribution.

Because it's everywhere, for some people, it's the only distribution they know. Like the old saying goes, if you only have a hammer, every problem is a nail. This distribution can be over-used, with bad consequences.

Here's the distribution. It ought to look familiar.

Reading more

The normal distribution

Lognormal distribution

How long do visitors spend on web pages? What about the distribution of internet traffic? Or the distribution of city sizes? These all follow a log-normal distribution that looks like the example below. The lognormal distribution is quite common in business.

Note the 'fat tail' or 'long tail' on the right-hand side. Many businesses have been caught out because they assumed sales or market risk followed a normal distribution when in fact they followed a lognormal distribution.

There's a variation of the Central Limit Theorem that yields log-normal distributions instead of normal distributions.

Reading more

Limpert, Eckhard, Werner A. Stahel, and Markus Abbt. "Log-normal distributions across the sciences: keys and clues" BioScience 51.5 (2001): 341-352.

Other distributions

There are lots and lots of different distributions. I saw a list of 90 the other day. Almost all of them are esoteric and apply in a very limited set of cases. You don't have to know all of them but you should be aware that choosing the right distribution is important to make the correct estimates. The distributions I've listed in this blog post are probably the most important, and you should know them and their properties.

As you asked nicely, here is a list of some distributions.

Alpha Distribution

Anglit Distribution

Arcsine Distribution

Beta Distribution

Beta Prime Distribution

Bradford Distribution

Burr Distribution

Burr12 Distribution

Cauchy Distribution

Chi Distribution

Chi-squared Distribution

Cosine Distribution

Double Gamma Distribution

Double Weibull Distribution

Erlang Distribution

Exponential Distribution

Exponentiated Weibull Distribution

Exponential Power Distribution

Fatigue Life (Birnbaum-Saunders) Distribution

Fisk (Log Logistic) Distribution

Folded Cauchy Distribution

Folded Normal Distribution

Fratio (or F) Distribution

Gamma Distribution

Generalized Logistic Distribution

Generalized Pareto Distribution

Generalized Exponential Distribution

Generalized Extreme Value Distribution

Generalized Gamma Distribution

Generalized Half-Logistic Distribution

Generalized Inverse Gaussian Distribution

Generalized Normal Distribution

Gilbrat Distribution

Gompertz (Truncated Gumbel) Distribution

Gumbel (LogWeibull, Fisher-Tippetts, Type I Extreme Value) Distribution

Gumbel Left-skewed (for minimum order statistic) Distribution

HalfCauchy Distribution

HalfNormal Distribution

Half-Logistic Distribution

Hyperbolic Secant Distribution

Gauss Hypergeometric Distribution

Inverted Gamma Distribution

Inverse Normal (Inverse Gaussian) Distribution

Inverted Weibull Distribution

Johnson SB Distribution

Johnson SU Distribution

KSone Distribution

KStwo Distribution

KStwobign Distribution

Laplace (Double Exponential, Bilateral Exponential) Distribution

Left-skewed Lévy Distribution

Lévy Distribution

Logistic (Sech-squared) Distribution

Log Double Exponential (Log-Laplace) Distribution

Log Gamma Distribution

Log Normal (Cobb-Douglass) Distribution

Log-Uniform Distribution

Maxwell Distribution

Mielke’s Beta-Kappa Distribution

Nakagami Distribution

Noncentral chi-squared Distribution

Noncentral F Distribution

Noncentral t Distribution

Normal Distribution

Normal Inverse Gaussian Distribution

Pareto Distribution

Pareto Second Kind (Lomax) Distribution

Power Log Normal Distribution

Power Normal Distribution

Power-function Distribution

R-distribution Distribution

Rayleigh Distribution

Rice Distribution

Reciprocal Inverse Gaussian Distribution

Semicircular Distribution

Student t Distribution

Trapezoidal Distribution

Triangular Distribution

Truncated Exponential Distribution

Truncated Normal Distribution

Tukey-Lambda Distribution

Uniform Distribution

Von Mises Distribution

Wald Distribution

Weibull Maximum Extreme Value Distribution

Weibull Minimum Extreme Value Distribution

Wrapped Cauchy Distribution

Continuous or discrete - shaken or stirred?

Some quantities are discrete and some are continuous. A discrete quantity is something like a sales territory (e.g. Germany, Ireland, Spain) or customer count (you can't have 0.5 of a customer). A continuous quantity can take any value, for example, speed can be 45.2 kph, 120.01 kph, and so on. Some distributions apply to both continuous and discrete, and some apply only to continuous or discrete. To muddy the waters, sometimes continuous distributions are used to approximately model discrete quantities.

Business examples

Vehicles

Imagine you're running a delivery vehicle fleet. You need to keep your vehicles on the road, but you need to keep an eye on maintenance costs. You decide to use math to guide your decisions, so you work out the average lifetime for different components. You have two components A and B with the same lifetimes in miles. If either component fails, you have to tow the vehicle, which is very expensive.

Component A. Lifetime is 150,000 miles.
Component B. Lifetime is 150,000 miles.

A vehicle comes in for maintenance with 149,000 miles on the odometer. Should you replace components A and B?

As you might expect, there's a gotcha. Without knowing the probability distribution for failures, we can't make these decisions. For example, a windshield might have a uniform failure rate distribution, with the probability of failure for miles 1-100 the same as the probability of failure for miles 100,000-100,100. A clutch may have a failure rate that increases with mileage, the probability of failure at miles 100,000-100,100 being much higher than the probability of failure at miles 0-100. Because we know what a clutch and a windshield are, we might decide to replace the clutch and leave the windshield. But what if A and B were a serpentine belt and a heat shield?

The only way to make rational decisions is to understand what distribution the probability of failure follows, which may well be very different for different components (e.g. car seats vs. tires).

Marketing

A new analyst is studying the market for luxury goods in Germany. They have partial data for the fraction of the population that have a certain income. Using what they have, they assume their data is normally distributed and they make a forecast for the fraction of the population that will have an income high enough to afford luxury items. Do you think their forecast will be too low, just right, or too high?

Incomes are usually log-normally distributed, so the analyst, in this case, has chosen the wrong distribution. Because the lognormal has a very long right tail, the analyst's estimate is likely to be an underestimate and may be substantially out. A competitor might not make the same mistake.

Takeaways

I've interviewed people who claim data science on their resumes, but only know the normal distribution. If you assume your data is normal, when in reality it's log-normal or Poisson, things are going to go badly wrong for you. Any analyst in business needs to be very comfortable with different distributions and needs to know which may be applicable and when.

Monday, December 7, 2020

How to (maybe) win at dice

Does God play dice with the universe?

Imagine I gave you an ordinary die, not special in any way, and I asked you to throw the die and record your results (how many 1s, how many 2s, etc.). What would you expect the results to be? Do you think you could win by choosing some numbers rather than others? Are you sure?

(Image source: Wikimedia Commons. Author: Diacritica. License: Creative Commons.)

What you might expect

Let's say you thew the die 12,000 times, you might expect a probability distribution something like this. This is a uniform distribution where all results are equally likely.

You know you'll never get an absolutely perfect distribution, so in reality, your results might look something like this for 12,000 throws.

The deviations from the expected values are random noise that we can quantify. Further, we know that by adding more dice throws, random noise gets less and less and we approach the ideal uniform distribution more closely.

I've simulated dice throws in the plots below, the top chart is 12,000 throws and the chart on the bottom is 120,000 throws. The blue bars represent the actual results, the black circle represents the expected value, and the black line is the 95% confidence interval. Note how the results for 120,000 throws are closer to the ideal than the results from 12,000 throws.

What happened in reality - not what you expect

My results are simulations, but what happens when you throw dice thousands of times in the real world?

There's a short history of probability theorists and statisticians throwing dice and recording the results.

Weldon threw 12 dice 26,306 times by hand and sent the results to his friend Francis Galton.
Iversen ran an experiment where 219 dice were rolled 20,000 times.

Weldon's data set is widely used to illustrate statistical concepts, especially after Pearson used it to explain his $\chi^2$ technique in 1900.

Despite the excitement you see at the craps tables in Las Vegas, throwing dice thousands of times is dull and is, therefore, an ideal job for a computer. In 2009, Zachariah Labby created apparatus for throwing dice and recording the scores using a camera and image processing. You can read more about his apparatus and experimental setup here. He 'threw' 12 dice 26,306 times and his machine recorded the results.

In the chart below, the blue bars are his results, the black circle is the expected result, and the black line is the 95% confidence interval. I've taken the results from all 12 dice, so my throw count is $12 \times 26,306$.

This doesn't look like a uniform distribution. To state the obvious, 1 and 6 occurred more frequently than theory would suggest - the deviation from the uniform distribution is statistically significant. The dice he used were not special dice, they were off-the-shelf standard unbiased dice. What's going on?

Unbiased dice are biased

Take a very close look at a normal die, the type pictured at the start of this post which is the kind of die you buy in shops.

By convention, opposite faces on dice sum to 7, so 1 is opposite 6, 3 is opposite to 4, and so on. Now look very closely again at the picture at the start of the post. Look at the dots on the face of the dice. Notice how they're indented. Each hole is the same size, but obviously, the number of holes on each face is different. Let's think of this in terms of weight. Imagine we could weigh each face of the dice. Let's pair up the faces, each side is paired with the face opposite it. Now let's weigh the faces and compare them.

The greatest imbalance in weights is the 1-6 combination. This imbalance is what's causing the bias.

Obviously, the bias is small, but if you roll the die enough times, even a small bias becomes obvious.

Vegas here I come - or not...

So we know for dice bought in shops that 1 and 6 are ever so slightly more likely to occur than theory suggests. Now you know this, why aren't you booking your flight to Las Vegas? You could spend a week at the craps tables and make a little money.

Not so fast.

Let's look at the dice they use in Vegas.

(Image source: Wikimedia Commons. Author: Alper Atmaca License: Creative Commons.)

Notice that the dots are not indented. They're filled with colored material that's the same density as the rest of the dice. In other words, there's no imbalance, Vegas dice will give a uniform distribution, and 1 and 6 will occur as often as 2, 3, 4, or 5. You're going to have to keep punching the clock.

Some theory

Things are going to get mathematical from here on in. There won't be any new stories about dice or Vegas.

How did I get the expected count and error bars for each dice score? Let's say I threw the dice $x$ times, it seems obvious we would get an expected count of $\frac{x}{6}$ for each score, but why? What about the standard error?

Let's re-think the dice as a Bernoulli trial. Let's choose a score, say 1. If we throw the dice and it shows a 1, we consider that a success. If it shows anything else, we consider it a failure. Because we have a Bernoulli trial, we can use the binomial distribution to model the results.

Using Wikipedia's notation:

$n$ is the number of throws
$p$ is the probability of getting a 1, which is $\frac{1}{6}$
$q = 1- p$ is the probability of getting 2-6, which is $\frac{5}{6}$

So, again using Wikipedia's handy summary, for $n$ throws:

The mean is $np = 12 \times 26,306 \times \frac{1}{6} = 52,612$
The standard deviation is $\sqrt{npq} = \sqrt{12 \times 26,306 \times \frac{1}{6} \times \frac{5}{6}} = 209.388$
The 95% confidence interval is $52,202 $ to $53,022$ (standard deviation by 1.96).

Publications

Academics live or die by publications and by citations of their publications. Labby's work has rightly been widely cited on the internet. I keep hoping that some academic will be inspired by Labby and use modern robotic technology and image recognition to do huge (million-plus) classical experiments, like tossing coins or selecting balls from an urn. It seems like an easy win to be widely cited!

Sunday, November 29, 2020

Am I diseased? An introduction to Bayes theorem

What is Bayes' theorem and why is it so important?

Bayes' theorem is one of the key ideas of modern data science; it's enabling more accurate forecasting, it's leading to shorter A/B tests, and it's fundamentally changing statistical practices. In the last twenty years, Bayes' theorem has gone from being a cute probability idea to becoming central to many disciplines. Despite its huge impact, it's a simple statement of probabilities: what is the probability of an event occurring given some other event has occurred? How can something almost trivial be so revolutionary? Why all this change now? In this blog post, I'm going to give you a brief introduction to Bayes' theorem and show you why it's so powerful.

(Bayes theorem. Source: Wikimedia Commons. Author: Matt Buck. License: Creative Commons.)

A disease example without explicitly using Bayes' theorem

To get going, I want to give you a motivating example that shows you the need for Bayes' theorem. I'm using this problem to introduce the language we'll need. I'll be using basic probability theory to solve this problem and you can find all the theory you need in my previous blog post on probability. This example is adapted from Wayne W. LaMorte's page at BU; he has some great material on probability and it's well worth your time browsing his pages.

Imagine there's a town of 10,000 people. 1% of the town's population has a disease. Fortunately, there's a very good test for the disease:

If you have the disease, the test will give a positive result 99% of the time (sensitivity).
If you don't have the disease, the test will give a negative result 99% of the time (specificity).

You go into the clinic one day and take the test. You get a positive result. What's the probability you have the disease? Before you go on, think about your answer and the why behind it.

Let's start with some notation.

D+ and D- represent having the disease and not having the disease
T+ and T- represent testing positive and testing negative
P(D+) represents the probability of having the disease (with similar meanings for P(D-), P(T+), P(T-))
P(T+ | D+) is the probability of testing positive given that you have the disease.

We can write out what we know so far:

P(D+) = 0.01
P(T+ | D+) = 0.99
P(T- | D-) = 0.99

We want to know P(D+ | T+). I'm going to build a decision tree to calculate what I need.

There are 10,000 people in the town, and 1% of them have the disease. We can draw this in a tree diagram like so.

For each of the branches, D+ and D-, we can draw branches that show the test results T+ and T-:

For example, we know 100 people have the disease, of whom 99% will test positive, which means 1% will test negative. Similarly, for those who do not have the disease, (9,900), 99% will test negative (9,801), and 1% will test positive (99).

Out of 198 people who tested positive for the disease (P(T+) = P(T+ | D+) + P(T+ | D-)), 99 people have it, so P(D+ | T+) = 99/198. In other words, if I test positive for the disease, I have a 50% chance of actually having it.

There are two takeaways from all of this:

Wow! Really, only a 50% probability! I thought it would be much higher! (This is called the base rate fallacy).
This is a really tedious process and probably doesn't scale. Can we do better? (Yes: Bayes' theorem.)

Who was Bayes?

Thomas Bayes (1702-1761), was an English non-conformist minister (meaning a protestant minister not part of the established Church of England). His religious duties left him time for mathematical exploration, which he did for his own pleasure and amusement; he never published in his lifetime in his own name. After his death, his friend and executor, Richard Price, went through his papers and found an interesting result, which we now call Bayes' theorem. Price presented it at the Royal Society and the result was shared with the mathematical community.

(Plaque commemorating Thomas Bayes. Source: Wikimedia Commons Author:Simon Harriyott License: Creative Commons.)

For those of you who live in London, or who visit London, you can visit the Thomas Bayes memorial in the historic Bunhill Cemetery where Bayes is buried. For the true probability pilgrim, it might also be worth visiting Richard Price's grave which is only a short distance away.

Bayes' theorem

The derivation of Bayes' theorem is almost trivial. From basic probability theory:

\[P(A \cap B) = P(A) P(B | A)\]

\[P(A \cap B) = P(B \cap A)\]

With some re-arranging we get the infamous theorem:

\[P(A | B) = \frac{P(B | A) P(A)}{P(B)}\]

Although this is the most compact version of the theorem, it's more usefully written as:

\[P(A | B) = \frac{P(B | A) P(A)}{P(B \cap A) + P(B \cap \bar A)} = \frac{P(B | A)P(A)}{P(B | A)P(A) + P(B | \bar A) P( \bar A)}\]

where $\bar A$ means not A (remember $1 = P(A) + P(\bar A)$). You can get this second form of Bayes using the law of total probability and the multiplication rule (see my previous blog post).

So what does it all mean and why is there so much excitement over something so trivial?

What does Bayes' theorem mean?

The core idea of Bayesian statistics is that we update our prior beliefs as new data becomes available - we go from the prior to the posterior. This process is often iterative and is called the diachronic interpretation of Bayes theorem. It usually requires some computation; something that's reasonable to do given today's computing power and the free availability of numeric computing languages. This form of Bayes is often written:

\[P(H | D) = \frac{P(D | H) P(H)}{P(D)}\]

with these definitions:

P(H) - the probability of the hypothesis before the new data - often called the prior
P(H | D) - the probability of the hypothesis after the data - the posterior
P(D | H) - the probability of the data under the hypothesis, the likelihood
P(D) - the probability of the data, it's called the normalizing constant

A good example of the use of Bayes' theorem is its use to better quantify the health risk an individual faces from a disease. Let's say the risk of suffering a heart attack in any year is P(HA), however, this is for the population as a whole (the prior). If someone smokes, the probability becomes P(HA | S), which is the posterior, which may be considerably different from P(HA).

Let's use some examples to figure out how Bayes works in practice.

The disease example using Bayes

Let's start from this version of Bayes:

\[P(A | B) = \frac{P(B | A)P(A)}{P(B | A)P(A) + P(B | \bar A) P( \bar A)}\]

and use the notation from our disease example:

\[P(D+ | T+) = \frac{P(T+ | D+)P(D+)}{P(T+ | D+)P(D+) + P(T+ | D-) P( D-)}\]

Here's what we know from our previous disease example:

P(D+) = 0.01 and by implication P(D-) = 0.99
P(T+ | D+) = 0.99
P(T- | D-) = 0.99 and by implication P(T+ | D-) = 0.01

Plugging in the numbers:

\[P(D+ | T+) = \frac{0.99\times0.01}{0.99\times0.01 + 0.01\times0.99} = 0.5\]

The decision tree is easier for a human to understand, but if there are a large number of conditions, it becomes much harder to use. For a computer on the other hand, the Bayes solution is straightforward to code and it's expandable for a large number of conditions.

Predicting US presidential election results

I've blogged a lot about this, but not about using Bayesian methods. The basic concepts are fairly simple.

To predict a winner, you need to model the electoral college, which implies a state-by-state forecast.
For each state, you know who won last time, so you have a prior in the Bayesian sense.
In competitive states, there are a number of opinion polls that provide evidence of voter intention, this is the data or normalizing constant in Bayes-speak.

In practice, you start with a state-by-state prior based on previous elections or fundamentals, or something else. As opinion polls are published, you calculate a posterior probability for each of the parties to win the state election. Of course, you do this with Bayes theorem. As more polls come in, you update your model and the influence of your prior becomes less and less. In some versions of this type of modeling work, models take into account national polling trends too.

The landmark paper describing this type of modeling is by Linzer.

Using Bayes' theorem to prove the existence of God

Over history, there have been many attempts to prove the existence of God using scientific or mathematical methods. All of them have floundered for one reason or another. Interestingly, one of the first uses of Bayes' theorem was to try and prove the existence of God by proving miracles can happen. The argument was put forward by Richard Price himself. I'm going to repeat his analysis using modern notation, based on an explanation from Cornell University.

Price's argument is based on tides. We expect tides to happen every day, but if a tide doesn't happen, that would be a miracle. If T is the consistency of tides, and M is a miracle (no tide), then we can use Bayes theorem as:

\[P(M | T) = \frac{P(T | M) P(M)}{P(T | M) P(M) + P(T | \bar M) P(\bar M)}\]

Price assumed the probability of miracles existing was the same as the probability of miracles not existing (!), so $P(M) = P(\bar M)$. If we plug this into the equation above and simplify, we get:

\[P(M | T) = \frac{P(T | M)}{P(T | M) + P(T | \bar M)}\]

He further assumed that if miracles exist, they would be very rare (or we would see them all the time), so:

\[P(T | \bar M) >> P(T | M)\]

he further assumed that $P(T | M) = 1e^{-6}$ - in other words, if a miracle exists, it would happen 1 time in 1 million. He also assumed that if there were no miracles, tides would always happen, so $P(T | \bar M) = 1$. The upshot of all this is that:

\[P(M | T) = 0.000001\]

or, there's a 1 in a million chance of a miracle happening.

There are more holes in this argument than in a teabag, but it is an interesting use of Bayes' theorem and does give you some indication of how it might be used to solve other problems.

Monty Hall and Bayes

The Monty Hall problem has tripped people up for decades (see my previous post on the problem). Using Bayes' theorem, we can rigorously solve it.

Here's the problem. You're on a game show hosted by Monty Hall and your goal is to win the car. He shows you three doors and asks you to choose one. Behind two of the doors are goats and behind one of the doors is a car. Once you've chosen your door, Monty opens one of the other doors to show you what's behind it. He always chooses a door with a goat behind it. Next, he asks you the key question: "do you want to change doors?". Should you change doors and why?

I'm going to use the diachronic interpretation of Bayes theorem to figure out what happens if we don't change:

\[P(H | D) = \frac{P(D | H) P(H)}{P(D)} = \frac{P(D | H) P(H)}{P(D | H)P(H) + P(D | \bar H) P( \bar D)}\]

$P(H)$ is the probability our initial choice of door has a car behind it, which is $\frac{1}{3}$.
$ P( \bar H) = 1- P(H) = \frac{2}{3} $
$P(D | H) = 1$ this is the probability Monty will show me a door with a goat given that I have chosen the door with a car - it's always 1 because Monty always shows me the door with a goat
$P(D | \bar H) = 1$ this is the probability Monty will show me a door with a goat given that I have chosen the door with a goat - it's always 1 because Monty always shows me the door with a goat,

Plugging these numbers in:

\[P(H | D) = \frac{1 \times \frac{1}{3}}{1 \times \frac{1}{3} + 1 \times \frac{2}{3}} = \frac{1}{3}\]

If we don't change, then the probability of winning is the same as if Monty hadn't opened the other door. But there are only two doors, and $P(\bar H) + P(H) = 1$. In turn, this means our winning probability if we switch is $\frac{2}{3}$, so our best strategy is switching.

Searching for crashed planes and shipwrecks

On 1st June 2009, Air France Flight AF 447 crashed into the Atlantic. Although the flight had been tracked, the underwater search for the plane was complex. The initial search used Bayesian inference to try and locate where on the ocean floor the plane might be. It used data from previous crashes that assumed the underwater locator beacon was working. Sadly, the initial search didn't find the plane.

In 2011, a new team re-examined the data, with two crucial differences. Firstly, they had data from the first search, and secondly, they assumed the underwater locator beacon had failed. Again using Bayesian inference, they pointed to an area of ocean that had already been searched. The ocean was searched again (with the assumption the underwater beacon had failed), and this time the plane was found.

You can read more about this story in the MIT Technology Review and for more in-depth details, you can read the paper by the team that did the analysis.

It turns out, there's quite a long history of analysts using Bayes theorem to locate missing ships. In this 1971 paper, Richardson and Stone show how it was used to locate the wreckage of the USS Scorpion. Since then, a number of high-profile wrecks have been located using similar methods.

Sadly, even Bayes' theorem hasn't led to anyone finding flight MH370.

Other examples of Bayes' theorem

Bayes has been applied in many, many disciplines. I'm not going to give you an exhaustive list, but I will give you some of the more 'fun' ones.

Why now?

Using Bayes theorem can involve a lot of fairly tedious arithmetic. If the problem requires many iterations, there are lots of tedious calculations. This held up the adoption of Bayesian methods until three things happened:

Cheap computing.
The free and easy availability of mathematical computing languages.
Widespread skill to program in these languages.

By the late 1980s, computing power was sufficiently cheap to make Bayesian methods viable, and of course, computing has only gotten cheaper since then. Good quality mathematical languages were available by the late 1980s too (e.g. Fortran, MATLAB), but by the 2010s, Python and R had all the necessary functionality and were freely and easily available. Both Python and R usage had been growing for a while, but by the 2010s, there was a very large pool of people who were fluent in them.

As they say in murder mysteries, by the 2010s, Bayesian methods had the means, the motive, and the opportunity.

Bayes and the remaking of statistics

Traditional (non-Bayesian) statistics are usually called frequentist statistics. It has a long history and has been very successful, but it has problems. In the last 50 years, Bayesian analysis has become more successful and is now challenging frequentist statistics.

I'm not going to provide an in-depth critique of frequentist statistics here, but I will give you a high-level summary of some of the problems.

p-values and significance levels are prone to misunderstandings - and the choice of significance levels is arbitrary
Much of the language surrounding statistical tests is complex and rests on convention rather than underlying theory
The null hypothesis test is frequently misunderstood and misinterpreted
Prior information is mostly ignored.

Bayesian methods help put statistics on a firmer intellectual foundation, but the price is changing well-understood and working frequentist statistics. In my opinion, over the next twenty years, we'll see Bayesian methods filter down to undergraduate level and gradually replace the frequentist approach. But for right now, the frequentists rule.

Conclusion

At its heart, Bayes' theorem is almost trivial, but it's come to represent a philosophy and approach to statistical analysis that modern computing has enabled; it's about updating your beliefs with new information. A welcome side-effect is that it's changing statistical practice and putting it on a firmer theoretical foundation. Widespread change to Bayesian methods will take time, however, especially because frequentist statistics are so successful.

Reading more

Computer age statistical inference

Think Bayes

Bayesian Data Analysis

Monday, November 23, 2020

Chance encounters of the third kind: understanding probability

Why look back at basic probability?

Bayes' theorem lies at the heart of much of modern machine learning. Although it's relatively simple to understand, you do need some grounding in probability theory. This blog post is all about getting you up close and personal with probability theory so I can tell you all about Bayes in a later post.

(You can work out the probability aliens are on earth given that Elvis lives. Image source: Pixabay Author: Pete Linforth License: Pixabay.)

The very basics

Think of some event that might occur in the future, say winning the lottery, buying a new car, or England winning the World Cup. We can estimate the probability of these events happening; we can call the event A and the probability of the event occurring P(A). If the event is certain to occur, then P(A) =1, if it's certain not to occur, then P(A) = 0, and in all cases: 0 $\leq $ P(A) $\leq $ 1.

We'll consider the probability of several events I'm going to call A, B, C, etc. These can be any events at all, including aliens landing, Elvis making a comeback, or getting a pay raise at the end of the year.

The complementary rule

If the probability of an event A occurring is P(A), the probability of it not occurring is $1 - P(A)$. This is called the complement and different authors use different notation for it:

\[1 - P(A) = P(A^c) = P(A-) = P( \bar A) = P( \raise.25ex\hbox{$\scriptstyle\sim$} A) \]

Let me give you an example using one notation. Imagine 1% of the population has a disease and 99% don't, then:

\[1 = P(D+) + P(D-) = 0.01 + 0.99\]

Independence

Independence is a huge issue in probability modeling and it can lead to big errors if not handled correctly. On the face of it, it's a simple idea, but there are subtleties.

Two events are independent if one does not affect or influence the other in any way (alternatively, one event does not give any information about the other). For example, the odds of Joe Biden winning the 2020 Presidential election do not depend on the odds of New Zealand opening its borders to international travelers. Looking at things the other way, the odds of me winning the lottery are dependent on my purchasing a ticket (I have to buy a ticket to stand any chance of winning) - these are dependent events. I'm sure you can think of many other examples.

Independent and dependent events are treated very differently mathematically, the big mistake comes when events that are not independent are considered to be independent. For example, an organization might run many opinion polls in an election. The errors in the polls will not be independent of one another because the organization may well have a systemic bias that affects all their polls. There are similar problems in epidemiology; if you and I live together, my probability of catching an infectious disease is not independent of your probability of catching an infectious disease. The most famous example of confusing independent and dependent events was the subprime mortgage scandals of 2008 onwards. The analysts who developed the subprime mortgage default models assumed that mortgage defaults were independent of one another. Unfortunately for all of us, that wasn't the case in 2008. Economic conditions led to many defaults, which in turn led to broader financial problems, which in turn led to more defaults. In 2008 and onwards, sub-prime mortgage defaults were dependent on one another.

Disjoint (mutually exclusive) events

Two events are disjoint if they're mutually exclusive, in other words, if both can't happen. For example, only one of Joe Biden or Donald Trump can win the election - they both can't be President. In notation I'll explain later: $ P(A \ and \ B) = P(A \cap B) = 0$.

Probability A and B occurring (intersection) - the multiplication rule

What's the probability of A and B occurring (also known as their joint or conjoint probability)? Here's where we run into some notation issues. Some sources write 'and' and some use the symbol '$\cap$' - both mean the same thing.

Here's the rule for dependent events:

\[P(A \ and \ B) = P(A \cap B) = P(A) P(B | A)\]

Here's the rule for independent events:

\[P(A \ and \ B) = P(A \cap B) = P(A) P(B)\]

Here's the rule for disjoint events:

\[ P(A \ and \ B) = P(A \cap B) = 0\]

The and relationship is commutative:

\[P(A \cap B) = P(B \cap A)\]

Probability of A or B occurring (union) - the addition rule

What's the probability of A or B occurring? Some sources write 'or' and some write '$\cup$'. Here's the rule:
\[P(A \ or \ B) = P(A \cup B) = P(A) + P(B) - P(A \cap B) \]

\[= P(A) + P(B) - P(A)P(B | A)\]

The or relationship is commutative:

\[P(A \cup B) = P(B \cup A)\]

For disjoint events, the addition rule simplifies to:

\[P(A \ and \ B) = P(A \cup B) = P(A) + P(B) \]

because from before we have:

\[P(A \cap B) = 0\]

Conditional probability - the conditional rule

What's the probability I have a disease given I've tested positive for the disease? We use the | symbol to mean "given that", so P(A | B) means the probability of A happening given that B has occurred. Here are some examples from everyday life:

What's the probability I win the lottery given that I've bought a ticket?
What's the probability I will get a degree if I go to college?
What's the probability I will have an accident if I'm driving and if it's snowing and if it's dark?

The interesting thing about conditional probability is that it can be quite different from the 'raw' probability. For example, let's say you're from a poor family, you might only have a 10% chance of getting a degree, but if you get accepted to a college, the probability might shoot up to 50%, and if you actually go to college, the probability may get to 95%. The probability can change quite substantially depending on new information (as we'll see with Bayes' theorem).

The general rule is:

\[P(A | B) = \frac{P(B \cap A)}{P(B)}\]

If A and B are independent (A does not depend on B), then P(A | B) = P(A).

The law of total probability

There's a general form of this law and a more specific form. Because the specific form will be useful for Bayesian work later, we'll start with that.

\[P(A+) = P(A+ \cap \ B+) + P(A+ \cap \ B-)\]

In words, the probability of an event A+ occurring is the probability of the event A+ occurring and the event B+ occurring plus the probability of event A+ occurring and the probability of event B+ not occurring (B-). This might be clearer if we remember $1 = P(B+) + P(B-)$ and we think of probabilities using a Venn diagram.

The more general form of this law is:

\[P(A) = \sum_i{P(A \cap B_i)} = \sum_i{P(A | B_i)P(B_i)}\]

The law of total probability and conditional probabilities

One of the most useful forms of Bayes' theorem relies on the combination of the law of total probability and conditional probability. Here's the key relationship:

\[1 = P(A | B) + P(\bar A | B)\]

Let me put this into words. If event B happens, then either A or not A happens, there are no other options, so the two probabilities must sum to 1.

What use is probability theory?

I grew up hearing about the value of 'common sense', but probability theory often gives results that seem very counterintuitive and 'common sense' can lead you wildly astray. A fun example is the Monty Hall problem, but there are lots of other examples in the real world where the probability of something happening is not what it appears to be at first - and they're not so fun. The counter-intuitive example you find most often on the internet is the probability that you have a disease given a positive test result; it's mostly not what you think.

Bayes' theorem takes us into the world of the counter-intuitive and I'll talk about Bayes in a future blog post.

Monday, November 16, 2020

Geese or enemy aircraft? Receiver Operating Characteristic curves in machine learning

In a strange quirk of history, one of the ways of evaluating machine learning algorithms has its roots in World War II and was subsequently used in a range of disciplines, including psychiatry. Only much later was it used in machine learning, but it kept its original name: receiver operating characteristic (ROC). I'm going to look at the history of this technique and explain what it is and why it's so important.

Is it geese or is it enemy planes?

In 1940, the situation in Britain was dire; the country was engaged in a desperate stand against Hitler. To weaken the country, and break the will of the people, Nazi aircraft heavily bombed British cities, which was the infamous blitz. I've seen estimates of over 43,000 people killed and of course, there was huge damage to Britain's industrial and cultural infrastructure. Newsreel pictures and propaganda of the time give a view of the devastation. Britain stood alone against the Nazi threat; the Battle of Britain was an existential one.

(Office workers in London going to work through bomb damage. Image source: Wikimedia Commons, License: Public Domain.)

It was vital therefore to detect enemy aircraft as quickly as possible, so the British government used a new technology called radar. Radar receivers had a number of settings, for example, you could turn the gain (amplification) up, but what should the correct settings be? Obviously, you want to correctly identify enemy aircraft, but you don't want to identify a flock of geese as aircraft. If you divert limited resources to chasing wild geese, those resources aren't available to pursue the real threat. This is where the receiver operating characteristic curve comes in. It was a way of deciding the best operating point and/or deciding the best receiver.

Ways of being right and wrong

I've covered this before in a previous blog post about the confusion matrix, so I'll just briefly recap here. There are two ways to be right and two ways to be wrong if we're doing a binary classification (geese/enemy aircraft).

		Actual
		enemy aircraft	geese
Prediction	enemy aircraft	True Positive	False Positive
Prediction	geese	False Negative	True Negative

From the counts of the True Positives, False Negatives, etc. we can define two quantities:

\[TPR = \frac{TP}{TP + FN} = 1 - FNR\]

\[= True \ Positive \ Rate, sensitivity, recall, hit rate\]

\[FPR = \frac{FP}{FP + TN} = False \ Positive \ Rate, fall out\]

There are an overly large number of other quantities we can define to help us evaluate classification. But these quantities and numbers are points: they allow us to evaluate an algorithm at a point, or under a single operating condition.

A picture is worth a thousand words

The receiver operating characteristic is a plot of the True Positive Rate vs. the False Positive Rate for different settings. Generically, it looks something like this.

We get a curve by varying a parameter and measuring FNR and TPR at each of the parameter values. In the case of our World War II radar receiver, the parameter could be gain; increasing the gain changes the trade-off between TPR and FPR.

Let's imagine a receiver that was just a random selector - choosing geese or enemy aircraft based on the toss of a coin. We would expect it to give us a straight line at $45^o$. Over time, the random selector would tend to the 50-50 point on the straight line. A real receiver has to do better than chance, so it has to be above the random line. In the chart below, the chance line is the black dotted line.

An ideal receiver has very different properties from the chance line. I've indicated an ideal operating curve in red on the chart below - it always gives a 100% True Positive Rate.

The ROC chart allows us to compare the behavior of different algorithms or different receivers. We could draw out the ROC curve for two receivers for example and choose the best one (the highest line). Here's a graphical representation.

A more mathematical way of doing the same thing is to use the ROC curves, but work out an area under the curve (AUC). An ideal receiver has an AUC of 1 (the red line), but obviously, the higher the AUC, the better.

Machine learning

Classifiers enable us to make categorical decisions based on input data. For example, if a user types 'evening wear' into a shopping site, do you show them cocktail dresses or tuxedos? A machine learning algorithm might use the users' browsing behavior to make a guess about male or female clothing. But how correct is the algorithm? This is where ROC curves can be used to understand the degree of correctness and the appropriate algorithmic settings to use.

Uses of ROC curves outside of machine learning

ROC curves are used in a wide range of disciplines:

Medical test diagnosis - for example, studying the effect of BMI on health
Psychiatry - placing people into normal and abnormal groups
Chemical tests
Dementia screening
Radar development

More tongue-in-cheek, a group of medical researchers in Sydney, Australia used a ROC to find the optimal walking speed for men over 70 to avoid death. If you're interested, the optimal speed is 0.82m/s.

Limitations of ROC curves - precision-recall

In a previous blog post, I looked at the confusion matrix and talked about prevalence. The idea is simple: a biased data set can give you a false sense of the accuracy of your data. If your data is biased, a precision-recall plot may be more appropriate.

Going back to the confusion matrix, here's how we define precision and recall.

\[Precision = \frac{TP}{TP + FP}\]

\[Recall = \frac{TP}{TP + FN}\]

Here's a typical precision-recall curve.

Because precision gives us an indication of how relevant the results are, precision-recall curves are often used to evaluate information retrieval algorithms.

Despite the long track record for receiver operating characteristic curves, precision-recall curves may be a better evaluation method. However, old habits die hard and ROC curves still reign.

Don't lose sight of the end goal

ROC and precision-recall curves are all about the same thing: figuring out how useful an algorithm is. There are lots of different ways an algorithm can be wrong, which means different ways of investigating correctness. Don't lose sight of the fact that under the hood, machine learning algorithms are probabilistic.

Reading more

https://www.cambridge.org/core/services/aop-cambridge-core/content/view/S1481803500013336

https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.466.7628&rep=rep1&type=pdf

Monday, November 9, 2020

Dazed and confused: the confusion matrix and getting it right and wrong

How correct are my (machine learning) algorithms?

In machine learning, we're using algorithms to make predictions about outcomes based on input data. For example, given that a consumer at an online store views dog collars and dog leads, you might show them dog food if they search for 'pet food'. This is fairly obvious, but what if they then searched for evening wear, would you show them cocktail dresses or tuxedos?

(The confusion matrix can be confusing. Image source: Pixabay. Author: Erika Wittlieb. License: Pixabay license)

The confusion matrix is about quantifying the correctness of algorithms, but it's not sufficient of itself. Fortunately, there are quantities we can derive from the confusion matrix that will show up certain types of error as we'll see.

The confusion matrix

I'm going to use the example of an online store that sells pet products. Imagine an algorithm that tries to decide if a consumer has a cat or not. There are two ways the algorithm can be right and two ways the algorithm can be wrong. I'll draw it out as a matrix so you can see it a bit more easily. In reality, we might put counts of false negatives, etc. in the matrix.

		Actual
		cat	not cat
Prediction	cat	True Positive	False Positive
Prediction	not cat	False Negative	True Negative

All of this sounds great. It looks like we can define some rates and be done. Let's start with some definitions and see where we get to.

We might want to know often we said it was a cat when it actually was a cat, in other words, when it actually was positive, how often did we say it was positive. This is called the True Positive Rate (TPR), which is defined like this (where FNR is the False Negative Rate and is similarly defined):

\[TPR = \frac{TP}{TP + FN} = 1 - FNR = sensitivity, recall, hit rate\]

On the flip side, how often did we say not cat when it really was not cat (how often did we say negative when it really was negative):

\[TNR = \frac{TN}{TN + FP} = 1 - FPR = specificity, selectivity\]

There are a whole bunch of other metrics we can similarly define and I won't belabor the point by defining them all here (it seems as if every possible combination of true/false positive/negative has a name). I'm just going to show some of them in this table to give you a flavor.

		Actual		Parameter
		cat	not cat	Parameter
Prediction	cat	True Positive	False Positive	Precision (positive predictive value) \[\frac{TP}{FP + TP}\] False Discovery Rate \[\frac{FP}{FP + TP}\]
Prediction	not cat	False Negative	True Negative	False Omission Rate \[\frac{FN}{FN + TN}\]
Parameter		True Positive Rate (Recall, Sensitivity) \[\frac{TP}{TP + FN}\]	True Negative Rate (specificity) \[\frac{TN}{TN + FP}\] False Positive Rate \[\frac{FP}{TN + FP}\]

Be careful here; it's easy to get caught up on the names and definitions. You should focus on what this means for the correctness of your results.

We can use these metrics to help decide if our algorithms are good or not - but there are other things we need to consider.

Prevalence

One of the major issues in algorithmic bias has been prevalence. It's possible to get what seems like highly accurate results but for the results to be deeply biased by the underlying data. Again, the confusion matrix can help.

We can define the accuracy of an algorithm using this formula:

\[Accuracy = \frac{TP + TN}{TP + FP + TN + FN}\]

Let's imagine we're getting a really great accuracy. We're really good at saying it's a cat when it really is a cat. Doesn't this sound like a really great algorithm? Think about your answer before moving on.

The trouble is, it could be because almost all the underlying data is cat data. Imagine 95% of the data was cats and we said cat 100% of the time. Some of the metrics in the table would look wonderful. We'd get a 95% accuracy for example!

A version of this has happened in real life with awful consequences. Some of the human datasets that machine learning algorithms are trained on are biased: for example, they are disproportionally images of white people, or even worse, white males. In 2015, Google released a photo app that classified images. It misclassified pictures of black people as Gorillas. This is just horrendous on multiple levels. The problem here might be that their training data set didn't include many pictures of non-white people. The labeling algorithms were accurate, just so long as you're white.

To test for bias in the dataset, we look at a number called prevalence which represents the fraction of the data set that's in a category. In our example, the prevalence of cats would be 0.95 and non-cats 0.05, which reveals a huge bias towards cats. This might be OK if the site was aimed at cat lovers, but not so great if the site was trying to grow non-cat sales.

If you're doing any machine learning work for public consumption, you must consider prevalence.

One number to bind them all

Precision, recall, false discovery rate... there are lots of numbers here and it gets confusing. Why don't we create one metric that binds them all together? We would like a score of 1 for this metric to represent perfection, and 0 to represent total failure. Fortunately, there is such a metric and it's called the $F_1$ score.

I won't go into the derivation here, but I will give you the formula:

\[F_1 = \frac{TP}{TP + \frac{1}{2}(FP + FN)}\]

(for those of you who want a bit more, it's the harmonic mean of precision and recall).

Even the $F_1$ score isn't the end of it. It weighs precision and recall equally, but in reality, that might not be what we want. For example, we might consider a false positive much worse than a false negative (sending an innocent person to jail rather than setting a guilty person free for example). In these kinds of cases, there's a weighting factor $\beta$ we can apply.

We can define $\beta$ as:

\[\beta = \frac{TP + FP}{TP + FN}\] and we can create a revised F score as:

\[F_\beta = \frac{(1 + \beta^2) TP}{(1 + \beta^2) TP + \beta^2FN + FP}\]

All this looks a bit familiar

By the way, there are very obvious parallels here to statistics, specifically, $\alpha$, $\beta$, Type I, and Type II errors. We're getting quite close to statistical tests with some of these processes, which probably isn't surprising. Sadly, similar things are called by different names in different disciplines, a nice way to keep barriers to entry high.

Snakes and pirates

Both Python and R have libraries you can use that will give you the confusion matrix and quantities derived from it. In Python, you should look at confusion_matrix in scikit-learn. In R, you need confusionMatrix from the caret package.

What's next?

The confusion matrix is just the start. There are several techniques based on it that you can use to effectively evaluate algorithms. In a future blog post, I'm going to look at something called Receiver Operating Characteristic which has a very interesting history. The thought I want to leave you with is a simple one: the confusion matrix is a means of representing different ways of being right and wrong. You can use quantities derived from the matrix to indicate bias and to indicate correctness.

Monday, November 2, 2020

The null hypothesis test

What's null hypothesis testing?

In business, as in many other fields, we have to make decisions in the face of uncertainty. Does this technology improve conversion? Is the new sales process working? Is the new machine tool improving quality? Almost never are the answers to these questions absolutely certain; there will be probabilities we have to trade off to make our decision.

(Two hypotheses battling it out for supremacy. Image source: Wikimedia Commons. Author: Pierdante Romei. License: Creative Commons.)

Null hypothesis tests are a set of techniques that enable us to reach probabilistic conclusions in an unbiased way. They provide a level playing field to decide if an effect is there or not.

Although null hypothesis tests are widely taught in statistics classes, many people who've come into data science from other disciplines aren't familiar with the core ideas. Similarly, people with business backgrounds sometimes end up evaluating A/B tests where the correct interpretation of null hypothesis tests is critical to understanding what's going on.

I’m going to explain to you what null hypothesis testing is and some of the concepts needed to implement and understand it.

What result are you testing for?

To put it simply, a null hypothesis test is a test of whether there is an effect of a certain size present or not. The null hypothesis is that there is no effect, and the alternate hypothesis is that there is an effect.

At its heart, the test is about probability and not certainty. We can’t say for sure if there is an effect or not, what we can say is the probability of there being an effect. But probabilities are limited and we have to make binary go/no-go decisions - so null hypothesis tests include the idea of probability thresholds for deciding whether something is there or not.

To illustrate the use of a null hypothesis test, I’m going to use a famous example, that of the lady tasting tea.

In a research lab years ago, there was a woman who claimed she could tell the difference between cups of tea prepared in one of two ways:

The milk poured into the cup first and then the tea poured in
The tea poured first and then the milk poured in.

(Image source: Wikimedia Commons Artist: Ian Smith License: Creative Commons)

The researcher decided to do a test of her abilities by asking her to taste multiple cups of tea and state how she thought each cup had been prepared. Of course, it’s possible she could be 100% successful by chance alone.

We can set up a null hypothesis test using these hypotheses:

The null hypothesis is the most conservative option. Here it’s that she can’t taste the difference. More specifically, her success rate is indistinguishable from random chance.
The alternative hypothesis is that she can tell the difference. More specifically, her success rate is significantly different from random chance.

Let's define some quantities:

$ p_T $ - the proportion of cups of tea she correctly got
$ p_C $ - the proportion of cups of tea she would be expected to get by chance alone (by guessing)

We can write the null and alternative hypotheses as:

$ H_0: p_T = p_C$
$ H_1: p_T \neq p_C$

But the hypotheses in this form aren't enough. Will we insist she has to be correct every single time? Is there some threshold we expect her to reach before we accept her claim?

The null hypothesis is the first step in setting up a statistical test, but to make it useful, we have to go a step further and set up thresholds. To do this, we have to understand different types of errors.

Error types

To make things easy, we’ll call 'milk first' a positive and 'milk second' a negative.

For our lady testing tea, there are four possibilities:

She can say ‘milk first’ when it was 'milk first' – a true positive
She can say ‘milk first’ when it wasn’t 'milk first' – a false positive (also known as a Type I error)
She can say ‘milk second’ when it was 'milk second' – a true negative
She can say ‘milk second’ when it wasn’t 'milk second' – a false negative (also known as a Type II error)

This is usually expressed as a table like the one below.

		Null Hypothesis is
		True	False
Decision about null hypothesis	Fail to reject	True negative Correct inference Probability threshold= 1 - $ \alpha $	False negative Type II error Probability threshold= $ \beta $
Decision about null hypothesis	Reject	False positive Type I error Probability threshold = $ \alpha $	True positive Correct inference Probability threshold = Power = 1 - $ \beta $

We can assign probabilities to each of these outcomes. As you can see, there are two numbers that are important here, $\alpha$ and $\beta$; however, in practice, we consider $\alpha$ and 1-$\beta$ as the numbers of importance. $\alpha$ is called significance, and 1-$\beta$ is called power. We can set values for each of them prior to the test. By convention, $\alpha$ is usually 0.05, and 1-$\beta \geq $ 0.80.

Test results, test size, and p-values

Our lady could guess correctly by chance alone. We have to set up the test so a positive conclusion due to randomness is unlikely, hence the use of thresholds. The easiest way to do this is to set the test size correctly, i.e. set the number of cups of tea. Through some math I won't go into, we can use $\alpha$, (1-$\beta$), and the effect size to set the sample size. The effect size, in this case, is her ability to detect how the cup of tea was prepared above and beyond what would be expected by chance. For example, we might run a test to see if she was 20% better than chance.

To evaluate the test, we calculate a p-value from the test results. The p-value is the probability the test result was due to chance alone. Because this is so important, I'm going to explain it again using other words. Let's imagine the lady tasting tea was guessing. By guessing alone, she could get between 0% and 100% correct. We know the probability for each percentage. We know it's very unlikely she'll get 100% or 0% by guesswork, but more likely she'll get 50%. For the score she got, we can work out the probability of her getting this score (or higher) through chance alone. Let's say there was a 3% chance she could have gotten her score by guessing alone. Is this proof she's not guessing?

We compare the p-value to our $ \alpha$ threshold to decide which hypothesis is wrong. Let’s say our p-value was 0.03 and our $ \alpha $ value was 0.05, because 0.03 < 0.05 we reject the null hypothesis. In other words, we would accept that the lady was not guessing.

False negatives, false positives

Using $\alpha$ and a p-value, we can work out the chance of us saying there's an effect when there is none (a false positive). But what about a false negative? We could say there's no effect when there really is one. That might be as damaging to a business as a false positive. The quantity $\beta$ gives us the probability of a false negative. By convention, statisticians talk about the power (1-$\beta$) of a test which is the probability of detecting an effect of the size you think is there.

Single tail or two-tail tests

Technically, the way the null hypothesis is set up in the case of the lady tasting tea is a two-tailed test. To ‘succeed ’ she has to do a lot better than chance or she has to do a lot worse. That’s appropriate in this case because we’re trying to understand if she’s doing something else other than guessing.

We could set up the test differently so she has to only be right more often than chance suggests. This would be a one-tail test. One-tail tests are shorter than two-tail tests, but they’re more limited.

In business, we tend to do two-tailed tests rather than one-tailed tests.

Fail to reject the null or rejecting the null

Remember, we’re talking about probabilities and not certainties. Even if we gave our lady 100 cups to taste, there’s still a possibility she gets them all right due to chance alone. So we can’t say either the null or the alternate is true, all we can do is reject them at some threshold, or fail to reject them. In the case of a p-value of 0.03, a statistician wouldn’t say the alternate is true (the lady can taste the difference), but they would say ‘we reject the null hypothesis’. If the p-value was 0.1, it would be higher than the $ \alpha $ value and we would ‘fail to reject the null hypothesis’. This language is complex, but statisticians are trying to capture the idea that results are about probabilities, not certainties.

Choice of significance and power

Significance and power affect test size, so maybe we should choose them to make the test short? If you want to do a valid test, you're not free to choose any values of $\alpha$ and (1-$\beta$) you choose. Convention dictates that you stick to these ranges:

$\alpha \geq 0.95$ - anything less than this is usually considered a junk test.
(1-$\beta) \geq 0.8$ - anything less than this is not worth doing.

The why behind these values is the subject of another blog post.

The null hypothesis test summarized

This has been a very high-level summary of what happens in a null hypothesis test, for the sake of simplicity there are several steps I've left out and I've greatly summarized some ideas. Here's a simple summary of the steps I've discussed.

Decide if the test is one-tail or two-tail.
Create a null and alternate hypothesis.
Set values for $\alpha$ and (1-$\beta$) prior to the test.
After the test, calculate a p-value.
Compare the p-value to $\alpha$ to figure out a false positive probability
Check $\beta$ to figure out the probability of a false negative.

I've left out topics like the z-test and the t-test and a bunch of other important ideas.

Your takeaway should be that this process is complex and there are no shortcuts. At its heart, hypothesis testing is about deciding what's true when the data is uncertain and you need to do it without bias.

(Justice is supposed to be blind and balanced - like a null hypothesis test. Image source: Wikimedia Commons. License: GNU Free Documentation License.)

Problems with the null hypothesis test

Mathematically, there's controversy about the fundamentals of the procedure, but frankly, the controversy is too complex to discuss here - in any case, the controversy isn't over whether the procedures work or not.

A more serious problem is baked into the approach. At its heart, null hypothesis testing is about making a binary yes/no decision based on probabilistic data. The results are never certain. Unfortunately, test results are often taken as certain. For example, if we can't detect an effect in a test, it's often assumed there is no effect, but that's not true. This assumption that no detection = no effect has had tragic consequences in medical trials; there are high-profile cases where the negative side effects of a drug have been just below the threshold levels. Sadly, once the drugs have been released, the negative effects become well know with disastrous consequences, a good example being Vioxx.

You must be aware that a test failure doesn't mean there isn't an effect. It could mean there's an effect hovering just below your acceptance threshold.

Using the null hypothesis in business

This is all a bit abstract, so let's bring it back to business. What are some examples of null hypothesis tests in the business world?

A/B testing

Most of the time, we choose a two-tail test because we're interested in the possibility a change might make conversion or other metrics worse. The hypothesis test we use is usually of this form:

$H_0 : CR_B = CR_A$

$H_1 : CR_B \neq CR_A$

where CR is the conversion rate, or revenue per user per branch, or add to bag etc.

Manufacturing defects

Typically, these tests are one-tailed because we're only interested in an improvement. Here, the test might be:

$H_0 : DR_B = DR_A$

$H_1 : DR_B < DR_A$

where DR is the defect rate.

Closing thoughts

If all this seems a bit complex, arbitrary, and dependent on conventions, you're not alone. As it turns out, null hypothesis techniques are based on the shotgun marriage of two separate approaches to statistics. In a future blog post, I'll delve into this some more.

For now, here's what you should take away:

You should understand that you need education and training to run these kinds of tests. A good grounding in statistics is vital.
The results are probabilistic and not certain. A negative test doesn't mean an effect isn't there, it might just be hovering underneath the threshold of detection.

Reading more

https://www.sagepub.com/sites/default/files/upm-binaries/40007_Chapter8.pdf