Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Friday, October 10, 2025

Regression to the mean

An unfortunate phrase with unfortunate consequences

"Regression to the mean" is a simple idea that has profound consequences. It's led people astray for decades, if not centuries. I'm going to explain what it is, the consequences of not understanding it, and what you can do to protect yourself and your organization.

Let's give a simple definition for now: it's the tendency, when sampling data, for more extreme values to be followed by values closer to the mean. Here's an example, if I give the same children IQ tests over time, I'll see very high scores followed by more average scores, and some very low scores followed by more average scores. It doesn't mean the children are improving or getting worse, it's just regression to the mean. The problems occur when people attach a deeper meaning, as we'll see.

(Francis Galton, popularizer of "Regression to the mean")

What it means - simple examples

I'm going to start with an easy example that everyone should be familiar with, a simple game with a pack of cards.

  • Take a standard pack of playing cards and label the cards in each suit 1 to 13 (Ace is 1, 2 is 2, Jack is 11, etc.). The mean card value is 7.5. 
  • Draw a card at random. 
  • Imagine it's a Queen (12). Now, replace the card and draw another card. Is it likely the card will have a lower value or a higher value? 
    • The probability is (11/13), it will have a lower value. 
  • Now imagine you drew an ace (1), replace the card and draw again. 
    • The probability of drawing another ace is 1/13.
    • The probability of drawing a 2 or higher is 12/13. 
It's obvious in this example that "extreme" value cards are very likely to be followed by more "average" value cards. This is regression to the mean at work. It's nothing complex, just a probability distribution at work.

The cards example seems simple and obvious. Playing cards are very familiar and we're comfortable with randomness (in fact, almost all card games rely on randomness). The problem occurs when we have real measurements, we tend to give explanations to the data when randomness (and regression to the mean) is all that's there.

Let's say we're measuring the average speed of cars on a freeway. Here are 100 measurements of car speeds. What would you conclude about the freeway? What pattern can you see in the data and what does it tell you about driver behavior (e.g. lower speeds following higher speeds and vice versa)? What might cause it? 

['46.7', '63.3', '80.0', '71.7', '34.2', '55.0', '67.5', '34.2', '67.5', '67.5', '59.2', '63.3', '55.0', '34.2', '63.3', '63.3', '63.3', '59.2', '75.8', '71.7', '42.5', '42.5', '34.2', '34.2', '59.2', '67.5', '59.2', '71.7', '71.7', '67.5', '50.8', '63.3', '34.2', '63.3', '30.0', '38.3', '50.8', '34.2', '75.8', '75.8', '46.7', '80.0', '55.0', '46.7', '38.3', '38.3', '75.8', '59.2', '34.2', '42.5', '71.7', '71.7', '80.0', '80.0', '71.7', '34.2', '63.3', '71.7', '46.7', '42.5', '46.7', '46.7', '63.3', '80.0', '80.0', '38.3', '38.3', '46.7', '38.3', '34.2', '46.7', '75.8', '55.0', '30.0', '55.0', '75.8', '30.0', '42.5', '67.5', '30.0', '50.8', '67.5', '67.5', '71.7', '67.5', '67.5', '42.5', '75.8', '75.8', '34.2', '55.0', '50.8', '38.3', '71.7', '46.7', '71.7', '50.8', '71.7', '42.5', '42.5']

Let's imagine the authorities introduced a speed camera at the measurement I've indicated in red. What might you conclude about the effect of the speed camera?

You shouldn't conclude anything at all from this data. It's entirely random. In fact, it has the same probability distribution as the pack of cards example. I've used 13 different average speeds, each with the same probability of occurrence. What you're seeing is the result of me drawing cards from a pack and giving them floating point numbers like 71.7 instead of a number like 9. The speed camera had no effect in this case. The data set shows the regression to the mean and nothing more.

The pack of cards and the vehicles example are exactly the same example. In the pack of cards case, we understand randomness and we can intuitively see what regression to the mean actually means. Once we have a real world problem, like the cars on the freeway, our tendency is to look for explanations that aren't there and we discount randomness. Looking for meaning in random data has had bad consequences, as we'll see.

Schools example

In the last few decades in the US, several states have introduced standardized testing to measure school performance. Students in the same year group take the same test and, based on the results, the state draws conclusions about the relative standing of schools; it may intervene in low performing schools. The question is, how do we measure the success of these interventions? Surely, we would expect to see an improvement in test scores taken the next year? In reality, it's not so simple.

The average test result for a group of students will obviously depend on things like teaching, prior attainment etc. But there are also random factors at work. Individual students might perform better or worse than expected due to sickness, or family issues, or a host of other random issues. Of course, different year groups in the same school might have a different mix of abilities. All of which means that regression to the mean should show up in consecutive tests. In other words, low performing schools might show an improvement and high performing schools might show a degradation entirely due to random factors.

This isn't a theoretical example: regression to the mean has been clearly shown in school scores in Massachusetts, California and in other states (see Haney, Smith & Smith). Sadly, state politicians and civil servants have intervened based on scores and drawn conclusions where they shouldn't.

Children's education evokes a lot of emotion and political interest, which is not a good mix. It's important to understand concepts like regression to the mean so we can better understand what's really going on.

Heights example

"Regression to the mean" was originally called "regression to mediocrity", and was based on the study of human heights. If regression to mediocrity sounds very disturbing, it should do. It's closely tied to eugenics through Francis Galton. I'm not going to dwell on the links between statistics and eugenics here, but you should know the origins of statistics aren't sin free.

In 1880s England, Galton studied the heights of parents and their children. I've reproduced some of his results below. He found that parents who were above average height tended to have children closer to the average height, and that parent parents below average height tended to have children closer to the average height. This is the classic regression to the mean example. 

Think for the moment about possible different outcomes of a study like this. If taller parents had taller children, and shorter parents had shorter children, then we might expect to see two population groups emerging (short people and tall people) and maybe the start of speciation. Conversely, if tall parents had short children, and short parents had tall children, this would be very noticeable and commented on. Regression to the mean turns out to be a good explanation of what we observe in nature.

Galton's height study was very influential for both the study of genetics and the creation of statistics as a discipline.

New sports players

Let's take a cohort of baseball players in their first season. Obviously, talent makes a difference, but there are random factors at play too. We might expect some players to do extremely well, others to do well, some to do OK, some to do poorly, and some to do very poorly.  Regression to the mean tells us that some standout players may well perform worse the next year. Other, lower-ranked players will perform better for the same reason. The phenomena of new outstanding players performing worse in their second year is often called the "sophomore slump" and a lot has been written about it, but in reality, it can mostly be explained by regression to the mean.

You can read more about regression to the mean in sports here:

Business books

Popular business books often fall into the regression to the mean trap. Here's what happens. A couple of authors do an analysis of top performing businesses, usually measured by stock price, and find some commonalities. They develop these commonalities into a framework and write a best-selling business book whose thesis is, if you follow the framework, you'll be successful. They follow this with another book that's not quite as good. Then they write a third book that only the true believers read.

Unfortunately, the companies they select as winners don't do as well over a decade or more, and the longer the timescale, the worse the performance. Over the long-run, the authors' promise that they've found the elixir of success is shown to be not true. Their books go from the best seller list to the remainder bucket.

A company's stock price is determined by many factors, for example, its competitors, the market state, and so on. Only some of them are under the control of the company. Conditions change over time in unpredictable ways.  Regression to the mean suggests that great stock price performers now might not be in future, and low performers may do better. Regression to the mean neatly explains why picking winners today des not mean the same companies will be winners in the years to come. In other words, basic statistics makes a mockery of many business books.

Reading more:

  • The Halo Effect: . . . and the Eight Other Business Delusions That Deceive Managers - Phil Rosenzweig 

My experience

I've seen regression to the mean pop up in all kinds of business data sets and I've seen people make the classic mistake of trying to derive meaning from randomness. Here are some examples.

Sales data has a lot of random fluctuations, and of course, the smaller the sample, the greater the fluctuations. I've seen sales people have a stand out year followed by a very average year and vice versa. I've seen the same pattern at a regional and country level too. Unfortunately, I've also seen analysts tie themselves in knots trying to explain these patterns. Even worse, they've made foolish predictions based on small sample sets and just a few years' worth of data.

I've seen very educated people get very excited by changes in company assessment data. They think they've spotted something significant because companies that performed well one year tended to perform a bit worse the next etc. Regression to the mean explained all the data.

How not to be fooled

Regression to the mean is hidden in lots of data sets and can lead you into making poor decisions. If you're analyzing a dataset, here are some questions to ask:

  • Is your data the result of some kind of sampling process? 
  • Does randomness play a part in your collection process or in the data?
  • Are there unknowns that might influence your data?

If the answer to any of these questions is yes, you should assume you'll find regression to the mean in your dataset. Be careful about your analysis and especially careful about explaining trends. Of course, the smaller your data set, the more vulnerable you are.

You can estimate the effect of regression to the mean on your data using a variety of methods. I'm not going to go into them too much here because I don't want to make this blog post too long. In the literature, you'll see references on running a randomized control trial (RCT) also known as an A/B test. That's great in theory, but the reality is that it's not appropriate for most business situations. In practice, you'll have to run simulations or do some straightforward estimation of the fractional regression to the mean.

Monday, March 10, 2025

Everything you wanted to know about the normal distribution but were afraid to ask

Normal is all around you, and so is not-normal

The normal distribution is the most important statistical distribution. In this blog post, I'm going to talk about its properties, where it occurs, and why it's so very important. I'm also going to talk about how using the normal distribution when you shouldn't can lead to disaster and what you can do about it.

(Ainali, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons)

A rose by any other name

The normal distribution has a number of different names in different disciplines:

  • Normal distribution. This is the name used by statisticians and data scientists.
  • Gaussian distribution. This is what physicists call it.
  • The bell curve. The names used by social scientists and by people who don't understand statistics.

I'm going to call it the normal distribution in this blog post, and I'd advise you to call it this too. Even if you're not a data scientist, using the most appropriate name helps with communication.

What it is and what it looks like

When we're measuring things in the real world, we see different values. For example, if we measure the heights of 10 year old boys in a town, we'd see some tall boys, some short boys, and most boys around the "average" height. We can work out what fraction of boys are a certain height and plot a chart of frequency on the y axis and height on the x axis. This gives us a probability or frequency distribution. There are many, many different types of probability distribution, but the normal distribution is the most important.

(As an aside, you may remember making histograms at school. These are "sort-of" probability distributions. For example, you might have recorded the height of all the children in a class, grouped them into height ranges, counted the number of children in each height range, and plotted the chart. The y axis would have been a count of how many children in that height range. To turn this into a probability distribution, the y axis would become the fraction of all children in that height range. )

Here's what a normal probability distribution looks like. Yes, it's the classic bell curve shape which is exactly symmetrical.


The formula describing the curve is quite complex, but all you need to know for now is that it's described by two numbers: the mean (often written \(\mu\)) and a standard deviation (often written \(\sigma\)). The mean tells you where the peak is and the standard deviation gives you a measure of the width of the curve. 

To greatly summarize: values near the mean are the most likely to occur and the further you go from the mean, the less likely they are. This lines up with our boys' heights example: there aren't many very short or very tall boys and most boys are around the mean height.

Obviously, if you change the mean or the standard deviation, you change the curve, for example, you can change the location of the mean or you can make the curve wider or narrower. It turns out changing the mean and standard deviation just scales the curve because of its underlying mathematical properties. Most distributions don't behave like this; changing parameters can greatly change the entire shape of the distribution (for example, the beta distribution wildly changes shape if you change its parameters). The normal scaling property has some profound consequences, but for now, I'll just focus on one. We can easily map all normal distributions to one standard normal distribution. Because the properties of the standard normal are known, we can easily do math on the standard normal. To put it another way, it greatly speeds up what we need to do.

Why the normal distribution is so important

Here are some normal distribution examples from the real world.

Let's say you're producing precision bolts. You need to supply 1,000 bolts of a precise specification to a customer. Your production process has some variability. How many bolts do you need to manufacture to get 1,000 good ones? If you can describe the variability using a normal distribution (which is the case for many manufacturing processes), you can work out how many you need to produce.

Imagine you're outfitting an army and you're buying boots. You want to buy the minimum number of boots while still fitting everyone. You know that many body dimensions follow the normal distribution (most famously, chest circumference), so you can make a good estimate of how many boots of different sizes to buy.

Finally, let's say you've bought some random stocks. What might the daily change in value be? Under usual conditions, the change in value follows a normal distribution, so you can estimate what your portfolio might be worth tomorrow.

It's not just these three examples, many phenomena in different disciplines are well described by the normal distribution.

The normal distribution is also common because of something called the central limit theorem (CLT). Let's say I'm taking measurement samples from a population, e.g. measuring the speed of cars on a freeway. The CLT says that the distribution of the sample means will follow a normal distribution regardless of the underlying distribution.  In the car speed example, I don't know how the speeds are distributed, but I can calculate a mean and know how certain I am that the mean value is the true (population) mean. This sounds a bit abstract but it has profound consequences in statistics and means that normal distribution comes up time and time again.

Finally, it's important because it's so well-known. The math to describe and use the normal distribution has been known for centuries. It's been written about in hundreds of textbooks in different languages. More importantly, it's very widely taught; almost all numerate degrees will cover it and how to use it. 

Let's summarize why it's important:

  • It comes up in nature, in finance, in manufacturing etc.
  • It comes up because of the CLT.
  • The math to use it is standardized and well-known.

What useful things can I do with the normal distribution?

Let's take an example from the insurance world. Imagine an insurance company insures house contents and cars. Now imagine the claim distribution for cars follows a normal distribution and the claims distribution for house contents also follows a normal distribution. Let's say in a typical year the claims distributions look something like this (cars on the left, houses on the right).

(The two charts look identical except for the numbers on the x and y axis. That's expected. I said before that all normal distributions are just scaled versions of the standard normal. Another way of saying this is, all normal distribution plots look the same.)

What does the distribution look like for cars plus houses?

The long winded answer is to use convolution (or even Monte Carlo). But because the house and car distribution are normal, we can just do:

\(\mu_{combined} = \mu_{houses} + \mu_{cars} \)

\(\sigma_{combined}^2 = \sigma_{houses}^2 + \sigma_{cars}^2\)

So we can calculate the combined distribution in a heartbeat. The combined distribution looks like this (another normal distribution, just with a different mean and standard deviation).

To be clear: this only works because the two distributions were normal.

It's not just adding distributions together. The normal distribution allows for shortcuts if we're multiplying or dividing etc. The normal distribution makes things that would otherwise be hard very fast and very easy.

Some properties of the normal distribution

I'm not going to dig into the math here, but I am going to point out a few things about the distribution you should be aware of.

The "standard normal" distribution goes from \(-\infty\) to \(+\infty\). The further away you get from the mean, the lower the probability, and once you go several standard deviations away, the probability is quite small, but never-the-less, it's still present. Of course, you can't show \(\infty\) on a chart, so most people cut off the x-axis at some convenient point. This might give the misleading impression that there's an upper or lower x-value; there isn't. If your data has upper or lower cut-off values, be very careful modeling it using a normal distribution. In this case, you should investigate other distributions like the truncated normal.

The normal distribution models continuous variables, e.g. variables like speed or height that can have any number of decimal places (but see the my previous paragraph on \(\infty\)). However, it's often used to model discrete variables (e.g. number of sheep, number of runs scored, etc.). In practice, this is mostly OK, but again, I suggest caution.

Abuses of the normal distribution and what you can do

Because it's so widely known and so simple to use, people have used it where they really shouldn't. There's a temptation to assume the normal when you really don't know what the underlying distribution is. That can lead to disaster.

In the financial markets, people have used the normal distribution to predict day-to-day variability. The normal distribution predicts that large changes will occur with very low probability; these are often called "black swan events". However, if the distribution isn't normal, "black swan events" can occur far more frequently than the normal distribution would predict. The reality is, financial market distributions are often not normal. This creates opportunities and risks. The assumption of normality has lead to bankruptcies.

Assuming normality can lead to models making weird or impossible predictions. Let's say I assume the numbers of units sold for a product is normally distributed. Using previous years' sales, I forecast unit sales next year to be 1,000 units with a standard deviation of 500 units. I then create a Monte Carlo model to forecast next years' profits. Can you see what can go wrong here? Monte Carlo modeling uses random numbers. In the sales forecast example, there's a 2.28% chance the model will select a negative sales number which is clearly impossible. Given that Monte Carlo models often use tens of thousands of simulations, it's extremely likely the final calculation will have been affected by impossible numbers.  This kind of mistake is insidious and hard to spot and even experienced analysts make it.

If you're a manager, you need to understand how your team has modeled data. 

  • Ask what distributions they've used to model their data. 
  • Ask them why they've used that distribution and what evidence they have that the data really is distributed that way. 
  • Ask them how they're going to check their assumptions. 
  • Most importantly, ask them if they have any detection mechanism in place to check for deviation from their expected distribution.

History - where the normal came from

Rather unsatisfactorily, there's no clear "Eureka!" moment for the discovery of the distribution, it seems to have been the accumulation of the work of a number of mathematicians. Abraham de Moivre  kicked off the process in 1733 but didn't formalize the distribution, leaving Gauss to explicitly describe it in 1801 [https://medium.com/@will.a.sundstrom/the-origins-of-the-normal-distribution-f64e1575de29].

Gauss used the normal distribution to model measurement errors and so predict the path of the asteroid Ceres [https://en.wikipedia.org/wiki/Normal_distribution#History]. This sounds a bit esoteric, but there's a point here that's still relevant. Any measurement taking process involves some form of error. Assuming no systemic bias, these errors are well-modeled by the normal distribution. So any unbiased measurement taking today (e.g opinion polling, measurements of particle mass, measurement of precision bolts, etc.) uses the normal distribution to calculate uncertainty.

In 1810, Laplace placed the normal distribution at the center of statistics by formulating the Central Limit Theorem. 

The math

The probability distribution function is given by:

\[f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e ^ {-\frac{1}{2} ( \frac{x - \mu}{\sigma}) ^ 2  }\]

\(\sigma\) is the standard deviation and \(\mu\) is the mean. In the normal distribution, the mean is the same as the mode is the same as the median.

This formula is almost impossible to work with directly, but you don't need to. There are extensive libraries that will do all the calculations for you.

Adding normally distributed parameters is easy:

\(\mu_{combined} = \mu_{houses} + \mu_{cars} \)

\(\sigma_{combined}^2 = \sigma_{houses}^2 + \sigma_{cars}^2\)

Wikipedia has an article on how to combine normally distributed quantities, e.g. addition, multiplication etc. see  https://en.wikipedia.org/wiki/Propagation_of_uncertainty.

Thursday, February 13, 2025

Why assuming independence can be very, very bad for business

Independence in probability

Why should I care about independence?

Many models in the finance industry and elsewhere assume events are independent. When this assumptions fails, catastrophic losses can occur as we saw in 2008 and 1992. The problem is, developers and data scientists assume independence because it greatly simplifies problems, but the executive team often don't know this has happened, or even worse, don't understand what it means. As a result, the company ends up being badly caught out when circumstances change and independence no longer applies.

(Sergio Boscaino from Busseto, Italy, CC BY 2.0 , via Wikimedia Commons)

In this post, I'm going to explain what independence is, why people assume it, and how it can go spectacularly wrong. I'll provide a some guidance for managers so they know the right questions to ask to avoid disaster. I've pushed the math to the end, so if math isn't your thing, you can leave early and still get benefit.

What is independence?

Two events are independent if the outcome of one doesn't affect the other in any way. For example, if I throw two dice, the probability of me throwing a six on the second die isn't affected in any way by what I throw on the first die. 

Here are some examples of independent events:

  • Throwing a coin and getting a head, throwing a dice and getting a two.
  • Drawing a king from a deck of cards, winning the lottery having bought a ticket.
  • Stopping at at least one red light on my way to the store, rain falling two months from now.
By contrast, some events are not independent (they're dependent):
  • Raining today and raining tomorrow. Rain today increases the chances of rain tomorrow.
  • Heavy snow today and a football match being played. Heavy snow will cause the match to be postponed.
  • Drawing a king from a deck of cards, then without replacing the card, drawing a king on the second draw.

Why people assume independence

People assume independence because the math is a lot, lot simpler. If two events are dependent, the analyst has to figure out the relationship between them, something that can be very challenging and time consuming to do. Other than knowing there's a relationship, the analyst might have no idea what it is and there may be no literature to guide them. If you have no idea what the relationship is, it's easier to assume there's none.

Sometimes, analysts assume independence because they don't know any better. If they're not savvy about probability theory, they may do a simple internet search on combining probabilities that will suggest all they have to do is multiple probabilities, and they assume they can use this process all the time. This is assuming independent through ignorance. I believe people are making this mistake in practice because I've interviewed candidates with MS degrees in statistics who've made this kind of blunder.

Money and fear can also drive the choice to assume independence. Imagine you're an analyst. Your manager is on your back to deliver a model as soon as possible. If you assume independence, your model will be available on time and you'll get your bonus, if you don't, you won't hit your deadline and you won't get your bonus. Now imagine the bad consequences of assuming independence won't be visible for a while. What would you do?

Harder examples

Do you think the following are independent?

  • Two unrelated people in different towns defaulting on their mortgage at the same time
  • Houses in different towns suffering catastrophic damage (e.g. fire, flood, etc.)

Most of the time, these events will be independent. For example, a house burning down because of poor wiring doesn't tell you anything about the risk of a house burning down in a different town (assuming a different electrician!). But there are circumstances when the independence assumption fails:

  • A hurricane hits multiple towns at once causing widespread catastrophic damage in different insurance categories (e.g. hurricane Andrew in 1992).
  • A recession hits, causing widespread lay-offs and mortgage defaults, especially for sub-prime mortgages (2008).

Why independence fails

Prior to 1992, the insurance industry had relatively simple risk models. They assumed independence; an assumption that worked well for some time. In an average year, they knew roughly how many claims there would be for houses, cars etc. Car insurance claims were independent of house insurance claims that in turn were independent of municipal and corporate insurance claims and so on. They were independent, at least in part, because there was no common causal mechanism.

When hurricane Andrew hit Florida in 1992, it  destroyed houses, cars, schools, hospitals etc. across multiple towns. The assumption of independence just wasn't true in this case. The insurance claims were sky high and bankrupted several companies. 

(Hurricane Andrew, houses destroyed in Dade County, Miami. Image from FEMA. Source: https://commons.wikimedia.org/wiki/File:Hurricane_andrew_fema_2563.jpg)

To put it simply, the insurance computer models didn't adequately model the risk because they had independence baked in.  

Roll forward 15 years and something similar happened in the financial markets. Sub-prime mortgage lending was build on a set of assumptions, including default rates. The assumption was, mortgage defaults were independent of one another. Unfortunately, as the 2008 financial crisis hit, this was no longer valid. As more people were laid-off, the economy went down, so more people were laid-off. This was often called contagion but perhaps a better analogy is the reverse of a well known saying: "a rising tide floats all boats".


Financial Crisis Newspaper
(Image credit: Secret London 123, CC BY-SA 2.0, via Wikimedia Commons)

The assumption of independence simplified the analysis of sub-prime mortgages and gave the results that people wanted. The incentives weren't there to price in risk. Imagine your company was making money hand over fist and you stood up and warned people of the risks of assuming independence. Would you put your bonus and your future on the line to do so?

What to do - recommendations

Let's live in the real world and accept that assuming independence gets us to results that are usable by others quickly.

If you're a developer or a data scientist, you must understand the consequences of assuming independence and you must recognize that you're making that assumption.  You must also make it clear what you've done to your management.

If you're a manager, you must be aware that assuming independence can be dangerous but that it gets results quickly. You need to ask your development team about the assumptions they're making and when those assumptions fail. It also means accepting your role as a risk manager; that means investing in development to remove independence.

To get results quickly, it may well be necessary for an analyst to assume independence.  Once they've built the initial model (a proof of concept) and the money is coming in, then the task is to remove the independence assumption piece-by-piece. The mistake is to stop development.

The math

Let's say we have two events, A and B, with probabilities of occurring P(A) and P(B). 

If the events are independent, then the probability of them both occurring is:

\[P(A \ and \ B) = P(A  \cap B) = P(A) P(B)\]

This equation serves as both a definition of independence and test of independence as we'll see next.

Let's take two cases and see if they're independent:

  1. Rolling a dice and getting a 1 and a 2
  2. Rolling a dice and getting a (1 or 2) and (2, 4, or 6)

For case 1, here are the probabilities:
  • \(P(A) = 1/6\)
  • \(P(B) = 1/6\)
  • \(P(A  \cap B) = 0\), it's not possible to get 1 and 2 at the same time
  • \(P(A )P(B) = (1/6) * (1/6)\)
So the equation \(P(A \ and \ B) = P(A  \cap B) = P(A) P(B)\) isn't true, therefore the events are not independent.

For case 2, here are the probabilities:
  • \(P(A) = 1/3\)
  • \(P(B) = 1/2\)
  • \(P(A  \cap B) = 1/6\)
  • \(P(A )P(B) = (1/2) * (1/3)\)
So the equation is true, therefore the events are independent.

Dependence uses conditional probability, so we have this kind of relationship:
\[P(A \ and \ B) = P(A  \cap B) = P(A | B) P(B)\]
The expression \(P(A | B)\) means the probability of A given that B has occurred (e.g the probability the game is canceled given that it's snowed). There are a number of ways to approach finding \(P(A | B)\), the most popular over the last few years has been Bayes' Theorem which states:
\[P(A | B) = \frac{P(B | A) P(A)}{P(B)}\]
There's a whole methodology that goes with the Bayesian approach and I'm not going to go into it here, except to say that it's often iterative; we make an initial guess and progressively refine it in the light of new evidence. The bottom line is, this process is much, much harder and much more expensive than assuming independence. 

Monday, July 31, 2023

Essential business knowledge: the Central Limit Theorem

Knowing the Central Limit Theorem means avoiding costly mistakes

I've spoken to well-meaning analysts who've made significant mistakes because they don't understand the implications of one of the core principles of statistics; the Central Limit Theorem (CLT). These errors weren't trivial either, they affected salesperson compensation and the analysis of A/B tests. More personally, I've interviewed experienced candidates who made fundamental blunders because they didn't understand what this theorem implies.

The CLT is why the mean and standard deviation work pretty much all the time but it's also why they only work when the sample size is "big enough". It's why when you're estimating the population mean it's important to have as large a sample size as you can. It's why we use the Student's t-test for small sample sizes and why other tests are appropriate for large sample sizes. 

In this blog post, I'm going to explain what the CLT is, some of the theory behind it (at a simple level), and how it drives key business statistics. Because I'm trying to communicate some fundamental ideas, I'm going to be imprecise in my language at first and add more precision as I develop the core ideas. As a bonus, I'll throw in a different version of the CLT that has some lesser-known consequences.

How we use a few numbers to represent a lot of numbers

In all areas of life, we use one or two numbers to represent lots of numbers. For example, we talk about the average value of sales, the average number of goals scored per match, average salaries, average life expectancy, and so on. Usually, but not always, we get these numbers through some form of sampling, for example, we might run a salary survey asking thousands of people their salary, and from that data work out a mean (a sampling mean). Technically, the average is something mathematicians call a "measure of central tendency" which we'll come back to later.

We know not everyone will earn the mean salary and that in reality, salaries are spread out. We express the spread of data using the standard deviation. More technically, we use something called a confidence interval which is based on the standard deviation. The standard deviation (or confidence interval) is a measure of how close we think our sampling mean is to the true (population) mean (how confident we are).

In practice, we use standard formula for the mean and standard deviation. These are available as standard functions in spreadsheets and programming languages. Mathematically, this is how they're expressed.

\[sample\; mean\; \bar{x}= \frac{1}{N}\sum_{i=0}^{N}x_i\]

\[sample\; standard\; deviation\; s_N = \sqrt{\frac{1}{N} \sum_{i=0}^{N} {\left ( x_i - \bar{x} \right )} ^ 2 } \]

All of this seems like standard stuff, but there's a reason why it's standard, and that's the central limit theorem (CLT).

The CLT

Let's look at three different data sets with different distributions: uniform, Poisson, and power law as shown in the charts below.

These data sets are very, very different. Surely we have to have different averaging and standard deviation processes for different distributions? Because of the CLT, the answer is no. 

In the real world, we sample from populations and take an average (for example, using a salary survey), so let's do that here. To get going, let's take 100 samples from each distribution and work out a sample mean. We'll do this 10,000 times so we get some kind of estimate for how spread out our sample means are.

The top charts show the original population distribution and the bottom charts show the result of this sampling means process. What do you notice?

The distribution of the sampling means is a normal distribution regardless of the underlying distribution.

This is a very, very simplified version of the CLT and it has some profound consequences, the most important of which is that we can use the same averaging and standard deviation functions all the time.

Some gentle theory

Proving the CLT is very advanced and I'm not going to do that here. I am going to show you through some charts what happens as we increase the sample size.

Imagine I start with a uniform random distribution like the one below. 

I want to know the mean value, so I take some samples and work out a mean for my samples. I do this lots of times and work out a distribution for my mean. Here's what the results look like for a sample size of 2, 3,...10,...20,...30,...40. 

As the sample size gets bigger, the distribution of the means gets closer to a normal distribution. It's important to note that the width of the curve gets narrower with increasing sample size. Once the distribution is "close enough" to the normal distribution (typically, around a sample size of 30), you can use normal distribution methods like the mean and standard deviation.

The standard deviation is a measure of the width of the normal distribution. For small sample sizes, the standard deviation underestimates the width of the distribution, which has important consequences.

Of course, you can do this with almost any underlying distribution, I'm just using a uniform distribution because it's easier to show the results 

Implications for averages

The charts above show how the distribution of the means changes with sample size. At low sample sizes, there are a lot more "extreme" values as the difference between the sample sizes of 2 and 40 shows.  Bear in mind, the width of the distribution is an estimate of the uncertainty in our measurement of the mean.

For small sample sizes, the mean is a poor estimator of the "average" value; it's extremely prone to outliers as the shape of the charts above indicates. There are two choices to fix the problem: either increase the sample size to about 30 or more (which often isn't possible) or use the median instead (the median is much less prone to outliers, but it's harder to calculate).

The standard deviation (and the related confidence interval) is a measure of the uncertainty in the mean value. Once again, it's sensitive to outliers. For small sample sizes, the standard deviation is a poor estimator for the width of the distribution. There are two choices to fix the problem, either increase the sample size to 30 or more (which often isn't possible) or use quartiles instead (for example, the interquartile range, IQR).

If this sounds theoretical, let me bring things down to earth with an example. Imagine you're evaluating salesperson performance based on deals closed in a quarter. In B2B sales, it's rare for a rep to make 30 sales in a quarter, in fact, even half that number might be an outstanding achievement. With a small number of samples, the distribution is very much not normal, and as we've seen in the charts above, it's prone to outliers. So an analysis based on mean sales with a standard deviation isn't a good idea; sales data is notorious for outliers. A much better analysis is the median and IQR. This very much matters if you're using this analysis to compare rep performance.

Implications for statistical tests

A hundred years ago, there were very few large-scale tests, for example, medical tests typically involved small numbers of people. As I showed above, for small sample sizes the CLT doesn't apply. That's why Gosset developed the Student's t-distribution: the sample sizes were too small for the CLT to kick in, so he needed a rigorous analysis procedure to account for the wider-than-normal distributions. The point is, the Student's t-distribution applies when sample sizes are below about 30.

Roll forward 100 years and we're now doing retail A/B testing with tens of thousands of samples or more. In large-scale A/B tests, the z-test is a more appropriate test. Let me put this bluntly: why would you use a test specifically designed for small sample sizes when you have tens of thousands of samples?

It's not exactly wrong to use the Student's t-test for large sample sizes, it's just dumb. The special features of the Student's t-test that enable it to work with small sample sizes become irrelevant. It's a bit like using a spanner as a hammer; if you were paying someone to do construction work on your house and they were using the wrong tool for something simple, would you trust them with something complex?

I've asked about statistical tests at interview and I've been surprised at the response. Many candidates have immediately said Student's t as a knee-jerk response (which is forgivable). Many candidates didn't even know why Student's t was developed and its limitations (not forgivable for senior analytical roles). One or two even insisted that Student's t would still be a good choice even for sample sizes into the hundreds of thousands. It's very hard to progress candidates who insist on using the wrong approach even after it's been pointed out to them.

As a practical matter, you need to know what statistical tools you have available and their limitations.

Implications for sample sizes

I've blithely said that the CLT applies above a sample size of 30. For "most" distributions, a sample size of about 30 is a reasonable rule-of-thumb, but there's no theory behind it. There are cases where a sample size of 30 is insufficient. 

At the time of writing, there's a discussion on the internet about precisely this point. There's a popular article on LessWrong that illustrates how quickly convergence to the normal can happen: https://www.lesswrong.com/posts/YM6Qgiz9RT7EmeFpp/how-long-does-it-take-to-become-gaussian but there's also a counter article that talks about cases where convergence can take much longer: https://two-wrongs.com/it-takes-long-to-become-gaussian

The takeaway from this discussion is straightforward. Most of the time, using a sample size of 30 is good enough for the CLT to kick-in, but occasionally you need larger sample sizes. A good way to test this is to use larger sample sizes and see if there's any trend in the data. 

General implications

The CLT is a double-edged sword: it enables us to use the same averaging processes regardless of the underlying distribution, but it also lulls us into a false sense of security and analysts have made blunders as a result.

Any data that's been through an averaging process will tend to follow a normal distribution. For example, if you were analyzing average school test scores you should expect them to follow a normal distribution, similarly for transaction values by retail stores, and so on. I've seen data scientists claim brilliant data insights by announcing their data is normally distributed, but they got it through an averaging process, so of course it was normally distributed. 

The CLT is one of the reasons why the normal distribution is so prevalent, but it's not the only reason and of course, not all data is normally distributed. I've seen junior analysts make mistakes because they've assumed their data is normally distributed when it wasn't. 

A little more rigor

I've been deliberately loose in my description of the CLT so far so I can explain the general idea. Let's get more rigorous so we can dig into this a bit more. Let's deal with some terminology first.

Central tendency

In statistics, there's something called a "central tendency" which is a measurement that summarizes a set of data by giving a middle or central value. This central value is often called the average. More formally, there are three common measures of central tendency:

  • The mode. This is the value that occurs most often.
  • The median. Rank order the data and this is the middle value.
  • The mean. Sum up all the data and divide by the number of values.

These three measures of central tendency have different properties, different advantages, and different disadvantages. As an analyst, you should know what they are.

(Depending on where you were educated, there might be some language issues here. My American friends tell me that in the US, the term "average" is always a synonym for the mean, in Britain, the term "average" can be the mean, median, or mode but is most often the mean.)

For symmetrical distributions, like the normal distribution, the mean, median, and mode are the same, but that's not the case for non-symmetrical distributions. 

The term "central" in the central limit theorem is referring to the central or "average" value.

iid

If you were taught about the Central Limit Theorem, you were probably taught that it only applies to iid data, which means independent and identically distributed. Here's what iid means. 

  • Each sample in the data is independent of the other samples. This means selecting or removing a sample does not affect the value of another sample.
  • All the samples come from the same probability distribution.
Actually, this isn't true. The CLT applies even if the distributions are not the same. However, the independence requirement still holds,

When the CLT doesn't apply

Fortunately for us, the CLT applies to almost all distributions an analyst might come across, but there are exceptions. The underlying distribution must have a finite variance, which rules out using it with distributions like the Cauchy distribution. The samples must be iid as I said before.

A re-statement of the CLT

Given data that's distributed with a finite variance and is iid, if we take n samples, then:

  • as \( n \to \infty \), the sample mean converges to the population mean
  • as \( n \to \infty \), the distribution of the sample means approximates a normal distribution

Note this formulation is in terms of the mean. This version of the CLT also applies to sums because the mean is just the sum divided by a constant (the number of samples).

A different version of the CLT

There's another version of the CLT that's not well-known but does come up from time to time in more advanced analysis. The usual version of the CLT is expressed in terms of means (which is the sum divided by a constant). If instead of taking the sum of the samples, we take their product, then instead of the products tending to a normal distribution they tend to a log-normal distribution. In other words, where we have a quantity created from the product of samples then we should expect it to follow a log-normal distribution. 

What should I take away from all this?

Because of the CLT, the mean and standard deviation mostly work regardless of the underlying distribution. In other words, you don't have to know how your data is distributed to do basic analysis on it. BUT the CLT only kicks in above a certain sample size (which can vary with the underlying distribution but is usually around 30) and there are cases when it doesn't apply. 

You should know what to do when you have a small sample size and know what to watch out for when you're relying on the CLT.

You should also understand that any process that sums (or products) data will lead to a normal distribution (or log-normal).

Monday, November 28, 2022

Is this coin biased?

Tossing and turning

A few months ago, someone commented on one of my blog posts and asked how you work out if a coin is biased or not. I've been thinking about the problem since then. It's not a difficult one, but it does bring up some core notions in probability theory and statistics which are very relevant to understanding how A/B testing works, or indeed any kind of statistical test. I'm going to talk you through how you figure out if a coin is biased, including an explanation of some of the basic ideas of statistical tests.

The trial

A single coin toss is an example of something called a Bernoulli trial, which is any kind of binary decision you can express as a success or failure (e.g. heads or tails). For some reason, most probability texts refer to heads as a success. 

We can work out what the probability is of getting different numbers of heads from a number of tosses, or more formally, what's the probability \(P(k)\) of getting \(k\) heads from \(n\) tosses, where \(0 < k ≤ n\)? By hand, we can do it for a few tosses (three in this case):

 Number of heads (k) Combinations (n)  Count  Probability 
 0 TTT 1 1/8
 1 HTT
 THT
 TTH
 3 3/8 
 2 THH
 HTH
 HHT
 3  3/8
 4 HHH 1 1/8

But what about 1,000 or 1,000,000 tosses - we can't do this many by hand, so what can we do? As you might expect, there's a formula you can use: 
\[P(k) = \frac{n!} {k!(n-k)!} p^k (1-p)^{n-k}\]
\(p\) is the probability of success in any trial, for example, getting a head. For an unbiased coin \(p=0.5\); for a coin that's biased 70% heads \(p=0.7\). 

If we plot this function for an unbiased coin (\(p=0.5\)), where \(n=100\), and \(0 < k ≤ n\), we see this probability distribution:

This is called a binomial distribution and it looks a lot like the normal distribution for large (\(> 30\)) values of \(n\). 

I'm going to re-label the x-axis as a score equal to the fraction of heads: 0 means all tails, 0.5 means \(\frac{1}{2}\) heads, and 1 means all heads. With this slight change, we can more easily compare the shape of the distribution for different values of \(n\). 

I've created two charts below for an unbiased coin (\(p=0.5\)), one with \(n=20\) and one with \(n=40\). Obviously, the \(n=40\) chart is narrower, which is easier to see using the score as the x-axis. 

As an illustration of what these charts mean, I've colored all scores 0.7 and higher as red. You can see the red area is bigger for \(n=20\) than \(n=40\). Bear in mind, the red area represents the probability of a score of 0.7 or higher. In other words, if you toss a fair coin 20 times, you have a 0.058 chance of seeing a score of 0.7 or more, but if you toss a fair coin 40 times, the probability of seeing a 0.7 score drops to 0.008.


These charts tell us something useful: as we increase the number of tosses, the curve gets narrower, meaning the probability of getting results further away from \(0.5\) gets smaller. If we saw a score of 0.7 for 20 tosses, we might not be able to say the coin was biased, but if we got a score of 0.7 after 40 tosses, we know this score is very unlikely so the coin is more likely to be biased.

Thresholds

Let me re-state some facts:

  • For any coin (biased or unbiased) any score from 0 to 1 is possible for any number of tosses.
  • Some results are less likely than others; e.g. for an unbiased coin and 40 tosses, there's only a 0.008 chance of seeing a score of 0.7.

We can use probability thresholds to decide between biased and non-biased coins.  We're going to use a threshold (mostly called confidence) of 95% to decide if the coin is biased or not. In the chart below, the red areas represent 5% probability, and the blue areas 95% probability.

Here's the idea to work out if the coin is biased. Set a confidence value, usually at 0.05. Throw the coin \(n\) times, record the number of heads and work out a score. Draw the theoretical probability chart for the number of throws (like the one I've drawn above) and color in 95% of the probabilities blue and 5% red. If the experimental score lands in the red zones, we'll consider the coin to be biased, if it lands in the blue zone, we'll consider it unbiased.

This is probabilistic decision-making. Using a confidence of 0.05 means we'll wrongly say a coin is biased 5% of the time. Can we make the threshold higher, could we use 0.01 for instance? Yes, we could, but the cost is increasing the number of trials.

As you might expect, there are shortcuts and we don't actually have to draw out the chart. In Python, you can use the binom_test function in the stats package. 

To simplify, binom_test has three arguments:

  • x - the number of successes
  • n - the number of samples
  • p - the hypothesized probability of success
It returns a p-value which we can use to make a decision.

Let's see how this works with a confidence of 0.05. Let's take the case where we have 200 coin tosses and 140 (70%) of them come up heads. We're hypothesizing that the coin is fair, so \(p=0.5\).

from scipy import stats
print(stats.binom_test(x=140, n=200, p=0.5))

the p-value we get is 1.5070615573524992e-08 which is way less than our confidence threshold of 0.05 (we're in the red area of the chart above). We would then reject the idea the coin is fair.

What if we got 112 heads instead?

from scipy import stats
print(stats.binom_test(x=115, n=200, p=0.5))

This time, the p-value is 0.10363903843786755, which is greater than our confidence threshold of 0.05 (we're in the blue area of the chart), so the result is consistent with a fair coin (we fail to reject the null).

What if my results are not significant? How many tosses?

Let's imagine you have reason to believe the coin is biased. You throw it 200 times and you see 115 heads. binom_test tells you you can't conclude the coin is biased. So what do you do next?

The answer is simple, toss the coin more times.

The formulae for the sample size, \(n\), is:

\[n = \frac{p(1-p)} {\sigma^2}\]

where \(\sigma\) is the standard error. 

Here's how this works in practice. Let's assume we think our coin is just a little biased, to 0.55, and we want the standard error to be \(\pm 0.04\). Here's how many tosses we would need: 154. What if we want more certainty, say \(\pm 0.005\), then the number of tosses goes up to 9,900. In general, the bigger the bias, the fewer tosses we need, and the more certainty we want the more tosses we need. 

If I think my coin is biased, what's my best estimate of the bias?

Let's imagine I toss the coin 1,000 times and see 550 heads. binom_test tells me the result is significant and it's likely my coin is biased, but what's my estimate of the bias? This is simple, it's actually just the mean, so 0.55. Using the statistics of proportions, I can actually put a 95% confidence interval around my estimate of the bias of the coin. Through math I won't show here, using the data we have, I can estimate the coin is biased 0.55 ± 0.03.

Is my coin biased?

This is a nice theoretical discussion, but how might you go about deciding if a coin is biased? Here's a step-by-step process.

  1. Decide on the level of certainty you want in your results. 95% is a good measure.
  2. Decide the minimum level of bias you want to detect. If the coin should return heads 50% of the time, what level of bias can you live with? If it's biased to 60%, is this OK? What about biased to 55% or 50.5%?
  3. Calculate the number of tosses you need.
  4. Toss your coin.
  5. Use binom_test to figure out if the coin deviates significantly from 0.5.

Other posts like this

Friday, January 7, 2022

Prediction, distinction, and interpretation: the three parts of data science

What does data science boil down to?

Data science is a relatively new discipline that means different things to different people (most notably, to different employers). Some organizations focus solely on machine learning, while other lean on interpretation, and yet others get close to data engineering. In my view, all of these are part of the data science role. 

I would argue data science generally is about three distinct areas:

  • Prediction. The ability to accurately extrapolate from existing data sets to make forecasts about future behavior. This is the famous machine learning aspect and includes solutions like recommender systems.
  • Distinction. The key question here is: "are these numbers different?". This includes the use of statistical techniques to decide if there's a difference or not, for example, specifying an A/B test and explaining its results. 
  • Interpretation. What are the factors that are driving the system? This is obviously related to prediction but has similarities to distinction too.

(A similar view of data science to mine: Calvin.Andrus, CC BY-SA 3.0, via Wikimedia Commons)

I'm going to talk through these areas and list the skills I think a data scientist needs. In my view, to be effective, you need all three areas. The real skill is to understand what type of problem you face and to use the correct approach.

Distinction - are these numbers different?

This is perhaps the oldest area and the one you might disagree with me on. Distinction is firmly in the realm of statistics. It's not just about A/B tests or quasi-experimental tests, it's also about evaluating models too.

Here's what you need to know:

  • Confidence intervals.
  • Sample size calculations. This is crucial and often overlooked by experienced data scientists. If your data set is too small, you're going to get junk results so you need to know what too small is. In the real world. increasing the sample size is often not an option and you need to know why.
  • Hypothesis testing. You should know the difference between a t-test and a z-test and when a z-test is appropriate (hint: sample size).
  • α, β, and power. Many data scientists have no idea what statistical power is. If you're doing any kind of statistical testing, you need to have a firm grasp of power.
  • The requirements for running a randomized control trial (RCT). Some experienced data scientists have told me they were analyzing results from an RCT, but their test just wasn't an RCT - they didn't really understand what an RCT was.
  • Quasi-experimental methods. Sometimes, you just can't run an RCT, but there are other methods you can use including difference-in-difference, instrumental variables, and regression discontinuity.  You need to know which method is appropriate and when. 
  • Regression to the mean. This is why you almost always need a control group. I've seen experienced data scientists present results that could almost entirely be explained by regression to the mean. Don't be caught out by one of the fundamentals of statistics.

Prediction - what will happen next?

This is the piece of data science that gets all the attention, so I won't go into too much detail.

Here's what you need to know:

  • The basics of machine learning models, including:
    • Generalized linear modeling
    • Random forests (including knowing why they are often frowned upon)
    • k-nearest neighbors/k-means clustering
    • Support Vector Machines
    • Gradient boosting.
  • Cross-validation, regularization, and their limitations.
  • Variable importance and principal component analysis.
  • Loss functions, including RMSE.
  • The confusion matrix, accuracy, sensitivity, specificity, precision-recall and ROC curves.

There's one topic that's not on any machine learning course or in any machine learning book that I've ever read, but it's crucially important: knowing when machine learning fails and when to stop a project.  Machine learning doesn't work all the time.

Interpretation - what's going on?

The main techniques here are often data visualization. Statistical summaries are great, but they can often mislead. Charts give a fuller picture. 

Here are some techniques all data scientists should know:

  • Heatmaps
  • Violin plots
  • Scatter plots and curve fitting
  • Bar charts
  • Regression and curve fitting.

They should also know why pie charts in all their forms are bad. 

A good knowledge of how charts work is very helpful too (the psychology of visualization).

What about SQL and R and Python...?

You need to be able to manipulate data to do data science, which means SQL, Python, or R. But plenty of people use these languages without being data scientists. In my view, despite their importance, they're table stakes.

Book knowledge vs. street knowledge

People new to data science tend to focus almost exclusively on machine learning (prediction in my terminology) which leaves them very weak on data analysis and data exploration; even worse, their lack of statistical knowledge sometimes leads them to make blunders on sample size and loss functions. No amount of cross-validation, regularization, or computing power will save you from poor modeling choices. Even worse, not knowing statistics can lead people to produce excellent models of regression to the mean.

Practical experience is hugely important; way more important than courses. Obviously, a combination of both is best, which is why PhDs are highly sought after; they've learned from experience and have the theoretical firepower to back up their practical knowledge.