Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Friday, January 9, 2026

I've got the power: what statistical power means

Important, but overlooked

Power is a crucial number to understand for hypothesis tests, but sadly, many courses omit it and it's often poorly understood if it's understood at all. To be clear, if you're doing any kind of A/B testing, you have to understand power.

In this blog post, I'm going to teach you all about power.

Hypothesis testing

All A/B tests, all randomized control trials (RCTs), and many other forms of testing are ultimately hypothesis tests; I've blogged about what this means before. To briefly summarize and simplify, we make a statement and measure the evidence in favor or against the statement using thresholds to make our decision.

With any hypothesis test, there are four possible outcomes (using simplified language):

  • The null hypothesis is actually true (there is no effect)
    • We say there is no effect (true negative)
    • We say there is an effect (false positive)
  • The null hypothesis is actually false (there is an effect)
    • We say there is no effect (false negative)
    • We say there is an effect (true positive)

I've summarized the possibilities in the table below.

    Null Hypothesis is
    True False
Decision about null hypothesis  Fail to reject True negative
Correct inference
Probability threshold= 1 - \( \alpha \)
False negative
Type II error
Probability threshold= \( \beta \)
Reject False positive
Type I error
Probability threshold = \( \alpha \)
True positive
Correct inference
Probability threshold = Power = 1 - \( \beta \)

A lot of attention goes on \(\alpha\), called the significance or significance level, which tells us the probability of a false positive. By contrast, power is the probability of detecting an effect if it's really there (true positive), sadly it doesn't get nearly the same level of focus.

By the way, there's some needless complexity here. It would seem more sensible for the two threshold numbers to be \( \alpha \) and \( \beta \) because they're defined very similarly (false positive and false negative). Unfortunately, statisticians tend to use power rather than \( \beta \). 

In pictures

To get a visual sense of what power is, let's look at how a null hypothesis test works in pictures. Firstly, we assume the null is true and we draw out acceptance and rejection regions on the probability distribution (first chart). To reject the null, our test results have to land in the red rejection regions in the top chart.

Now we assume the alternate hypothesis is true (second chart). We want to land in the blue region in the second chart, and we want a certain probability (power), or more, of landing in the blue region.

To be confident there is an effect, we want the power to be as high as possible.

Calculating power - before and after

Before we run a test, we calculate the sample size we need based on a couple of factors, including the power we want the test to have. For reasons I'll explain later, 80% or 0.8 is a common choice. 

Once we've run the test and we have the rest results, we then calculate the actual power based on the data we've recorded in our test. It's very common for the actual power to be different from what we specified in our test design. If the actual power is too low, that may mean we have to continue the test or redesign it.

Unfortunately, power is hard to calculate; there are no convenient closed-form formula and to make matters worse, some of the websites that offer power and sample size calculations give incorrect results. The G*Power package is probably the easiest tool for most people to use, though there are convenient libraries in R and Python that will calculate power for you. If you're going to understand power, you really do need to understand statistics.

To make all this understandable, let me walk you through a sample size calculation for a conversion rate A/B test for a website. 

  • A/B tests are typically large with thousands of samples, which means we're in z-test territory rather than t-test. 
  • We also need to decide what we're testing for. A one-sided test is testing for a difference in one direction only, either greater than or less than, a two-sided test tests for a difference (in either direction). Two-sided tests are more common because they're more informative. Some authors use the term one-tailed and two-tailed instead of one-sided or two-sided. 
  • Now we need to define the thresholds for our test, which are \( \alpha \)  and power. Common values are 0.05 and 0.8.  
  • Next up we need to look at the effect, in the conversion test example, we might have a conversion rate of 2% on one branch and expected conversion rate of 2.2% on the other branch. 
We can put all this into G*Power and here's what we get.

Test type Tail(s) \( \alpha \) Power Proportion 1 Proportion 2 Sample size
z-test Two-tailed 0.05 0.8 0.02 0.022 161364
z-test Two-tailed 0.05 0.95 0.02 0.022 267154

The first row of the table shows a power of 80% which leads to a sample size of 161,364. Increasing the power to 95% gives a sample size 267,154, a big increase and that's a problem. Power varies non-linearly with sample size as I've shown in the screen shot below for this data (from G*Power).

Conversion rates of 2% are typical for many retail sites. It's very rare that any technology will increase the conversion rate greatly. A 10% increase from 2% to 2.2% would be wonderful for a retailer and they'd be celebrating. Because of these numbers, you need a lot of traffic to make A/B tests work in retail, which means A/B tests can really only be used by large retailers.

Why not just reduce power and reduce the sample size? Because that's making the results of the test less reliable; at some point, you might as well just flip a coin instead of running a result. A lot of A/B tests are run when a retailer is testing new ideas or new paid-for technologies. An A/B test is there to provide a data-oriented view of whether the new thing works or not. The thresholds are there to give you a known confidence in the test results. 

After a test is done, or even partway through the test, we can can calculate the observed power. Let's use G*Power and the numbers from the first row of the table above, but assume a sample size of 120,000. This gives a power of 0.67, way below what's useful and too close to a 50-50 split. Of course, it's possible that we observe a a smaller effect than expected, and you can experiment with G*Power to vary the effect size and see the affect on power.

A nightmare scenario

Let's imagine you're an analyst at a large retail company. There's a new technology which costs $500,000 a year to implement. You've been asked to evaluate the technology using an A/B test. Your conversion rate is 2% and the new technology promises a conversion rate of 2.2%. You set \(\alpha\) to 0.05, and power to 0.8 and calculate a sample size (which also gives you a test duration). The null hypothesis is that there is no effect (conversion rate of 2%) and the alternate hypothesis is that the conversion rate is 2.2%.

Your boss will ask you "how sure are you of these results?". If you say there's no effect, they will ask you "how sure are you there's no effect?", if you say there is an effect, they will ask you "how sure are you there is an effect"? Think for a moment how you'd ideally like to answer these questions (100% sure is off the cards). The level of surety you can offer depends on your website traffic and the test.

When the test is over, you calculate a p-value of 0.01, which is less than your \(\alpha\), so you reject the null hypothesis. In other words, you think there's an effect. Next you calculate power. Let's say you get a 0.75. Your threshold for accepting a conversion rate of 2.2% is 0.8. What's next?

It's quite possible that the technology works, but just not increasing the conversion rate to 2.2%. It might increase conversion to 2.05% or 2.1% for example. These kinds of conversion rate lifts might not justify the cost of the technology.

What do you do?

You have four choices, each with positives and negatives.

  1. Reject the new technology because it didn't pass the test. This is a fast decision, but you run the risk of foregoing technology that would have helped the business.
  2. Carry on with the test until it reaches your desired power. Technically, the best, but it may take more time than you have available.
  3. Accept the technology with the lower power. This is a risky bet and very dangerous to do it regularly (lower thresholds mean you make more mistakes).
  4. Try a test with a lower lift, say an alternate hypothesis that the conversion rate is 2.1%.

None of these options are great. You need strong statistics to decide on the right way forward for your business.

(A/B testing was painted as an easy-to-use wonder technique. The reality is, it just isn't.)

What's a good value?

The "industry standard" power is 80%, but where does this come from? It's actually a quote from Michael Cohen in his 1988 book "Statistical Power Analysis for the Behavioral Sciences", he said if you're stuck and can't figure out what the power should be, use 80% as a last result. Somehow the value of last resort has become an unthinking industry standard. But what value should you chose?

Let's go back to the definitions of \( \alpha \) and \( \beta \) (remember, \( \beta \) is 1 - power).  \( \alpha \) corresponds to the probability of a false positive, \( \beta \) corresponds to the probability of a false negative. How do you balance these two false results? Do you think a false positive is equally as bad as false negative or do you think it's better or worse? The industry standard choices for \( \alpha \) and \( \beta \) are 0.05 and 0.20 (1 - 0.8), which means we think a false positive is four times worse than a false negative. Is that what you intended? Is that ratio appropriate for your business?

In retail, including new technologies on a website comes with a cost, but there's also the risk of forgoing revenue if you get a false negative. I'm tempted to advise you to choose the same \( \alpha \) and \( \beta \) value of 0.05 (which gives a power of 95%). This does increase the sample size and may take it beyond the reach of some websites. If you're bumping up against the limits of your traffic when designing tests, it's probably better to use something other than an A/B test.

Why is power so misunderstood?

Conceptually it's quite simple (probability of making a true positive observation), but it's wrapped up with the procedure for defining and using a null hypothesis test. Frankly, the whole null hypothesis setup is highly complex and unsatisfactory (Bayesian statistics may offer a better approach). My gut feeling is, \( \alpha \) is easy to understand, but once you get into the full language of a null hypothesis testing, people get left behind, which means they don't understand power.

Not understanding power leaves you prone to making bad mistakes, like under-powering tests. An underpowered test might mean you reject technologies that could increase conversion rate. Conversely, under-powered tests can lead you to claim a bigger effect than is really there. Overall, it leaves you vulnerable to making the wrong decision.

Thursday, December 18, 2025

The Skellam distribution

Distributions, distributions everywhere

There are a ton of distributions out there; SciPy alone has several hundred and that's nowhere near a complete set. I'm going to talk about one of the lesser known distributions, the Skellam distribution, and what it's useful for. My point is a simple one: it's not enough for data scientists to know the main distributions, they must be aware that other distributions exist and have real-world uses.

Overview of the Skellam distribution

It's easy to define the Skellam distribution: it's the difference between two Poisson distributions, or more formally, the difference between two Poisson distributed random variables. 

So we don't get lost in the math, here's a picture of a Skellam distribution.

If you really must know, here's how the PMF is defined mathematically:

\[ P(Z = k; \mu_1, \mu_2) = e^{-(\mu_1 + \mu_2)} \left(\frac{\mu_1}{\mu_2}\right)^{k/2} I_k(2\sqrt{\mu_1 \mu_2}) \] where \(I_k(x)\) is given by the modified Bessel function: \[ I_k(x) = \sum_{j=0}^{\infty} \frac{1}{j!(j+|k|)!} \left(\frac{x}{2}\right)^{2j+|k|} \]

this all looks very complicated, but by now (2025), it's easy to code up, here's the SciPy code to calculate the PMF:

probabilities = stats.skellam.pmf(k=k_values, mu1=mu1, mu2=mu2)

What use is it?

Here are just a few uses I found:

  • Finance: modeling price changes between trades.
  • Medicine: modeling the change in the number of beds in an ICU, epileptic seizure counts during drug trials, differences in reported AIDS cases, and so on.
  • Sports: differences in home and away team football or hockey scores.
  • Technology: modeling sensor noise in cameras, 

Where did it come from?

Skellam published the original paper on this distribution in 1946. There isn't a lot of background on why he did the work and, as far as I can tell, it wasn't related to World War II research work in any way. It's only really been discussed more widely once people discovered it's use for modeling sports scores. It's been available as an off-the-shelf distribution in SciPy for over a decade now.

As an analyst, what difference does this make to you?

I worked in a place where the data we analyzed wasn't normally distributed (which isn't uncommon, a lot of data sets aren't normally distributed), so it was important that everyone knew at least something about non-normal statistics. I interviewed job candidates for some senior positions and asked them how they would analyze some obviously non-normal data. Far too many of them suggested using methods only suitable for normally distributed data. Some candidates had Master's degrees in relevant areas and told me they had never been taught how to analyze non-normal data, and even worse, they never looked into it themselves. This was a major warning for us recruiting.

Let's imagine you're given a new data set in a new area and you want to model it. It's obviously not normal, so what do you do? In these cases, you need to have an understanding of what other distributions are out there and their general shape and properties. You should just be able to look at data and guess a number of distributions that could work. You don't need to have an encyclopedic knowledge of them all, you just need to know they exist and you should know how to use a few of them. 

Friday, October 10, 2025

Regression to the mean

An unfortunate phrase with unfortunate consequences

"Regression to the mean" is a simple idea that has profound consequences. It's led people astray for decades, if not centuries. I'm going to explain what it is, the consequences of not understanding it, and what you can do to protect yourself and your organization.

Let's give a simple definition for now: it's the tendency, when sampling data, for more extreme values to be followed by values closer to the mean. Here's an example, if I give the same children IQ tests over time, I'll see very high scores followed by more average scores, and some very low scores followed by more average scores. It doesn't mean the children are improving or getting worse, it's just regression to the mean. The problems occur when people attach a deeper meaning, as we'll see.

(Francis Galton, popularizer of "Regression to the mean")

What it means - simple examples

I'm going to start with an easy example that everyone should be familiar with, a simple game with a pack of cards.

  • Take a standard pack of playing cards and label the cards in each suit 1 to 13 (Ace is 1, 2 is 2, Jack is 11, etc.). The mean card value is 7.5. 
  • Draw a card at random. 
  • Imagine it's a Queen (12). Now, replace the card and draw another card. Is it likely the card will have a lower value or a higher value? 
    • The probability is (11/13), it will have a lower value. 
  • Now imagine you drew an ace (1), replace the card and draw again. 
    • The probability of drawing another ace is 1/13.
    • The probability of drawing a 2 or higher is 12/13. 
It's obvious in this example that "extreme" value cards are very likely to be followed by more "average" value cards. This is regression to the mean at work. It's nothing complex, just a probability distribution at work.

The cards example seems simple and obvious. Playing cards are very familiar and we're comfortable with randomness (in fact, almost all card games rely on randomness). The problem occurs when we have real measurements, we tend to give explanations to the data when randomness (and regression to the mean) is all that's there.

Let's say we're measuring the average speed of cars on a freeway. Here are 100 measurements of car speeds. What would you conclude about the freeway? What pattern can you see in the data and what does it tell you about driver behavior (e.g. lower speeds following higher speeds and vice versa)? What might cause it? 

['46.7', '63.3', '80.0', '71.7', '34.2', '55.0', '67.5', '34.2', '67.5', '67.5', '59.2', '63.3', '55.0', '34.2', '63.3', '63.3', '63.3', '59.2', '75.8', '71.7', '42.5', '42.5', '34.2', '34.2', '59.2', '67.5', '59.2', '71.7', '71.7', '67.5', '50.8', '63.3', '34.2', '63.3', '30.0', '38.3', '50.8', '34.2', '75.8', '75.8', '46.7', '80.0', '55.0', '46.7', '38.3', '38.3', '75.8', '59.2', '34.2', '42.5', '71.7', '71.7', '80.0', '80.0', '71.7', '34.2', '63.3', '71.7', '46.7', '42.5', '46.7', '46.7', '63.3', '80.0', '80.0', '38.3', '38.3', '46.7', '38.3', '34.2', '46.7', '75.8', '55.0', '30.0', '55.0', '75.8', '30.0', '42.5', '67.5', '30.0', '50.8', '67.5', '67.5', '71.7', '67.5', '67.5', '42.5', '75.8', '75.8', '34.2', '55.0', '50.8', '38.3', '71.7', '46.7', '71.7', '50.8', '71.7', '42.5', '42.5']

Let's imagine the authorities introduced a speed camera at the measurement I've indicated in red. What might you conclude about the effect of the speed camera?

You shouldn't conclude anything at all from this data. It's entirely random. In fact, it has the same probability distribution as the pack of cards example. I've used 13 different average speeds, each with the same probability of occurrence. What you're seeing is the result of me drawing cards from a pack and giving them floating point numbers like 71.7 instead of a number like 9. The speed camera had no effect in this case. The data set shows the regression to the mean and nothing more.

The pack of cards and the vehicles example are exactly the same example. In the pack of cards case, we understand randomness and we can intuitively see what regression to the mean actually means. Once we have a real world problem, like the cars on the freeway, our tendency is to look for explanations that aren't there and we discount randomness. Looking for meaning in random data has had bad consequences, as we'll see.

Schools example

In the last few decades in the US, several states have introduced standardized testing to measure school performance. Students in the same year group take the same test and, based on the results, the state draws conclusions about the relative standing of schools; it may intervene in low performing schools. The question is, how do we measure the success of these interventions? Surely, we would expect to see an improvement in test scores taken the next year? In reality, it's not so simple.

The average test result for a group of students will obviously depend on things like teaching, prior attainment etc. But there are also random factors at work. Individual students might perform better or worse than expected due to sickness, or family issues, or a host of other random issues. Of course, different year groups in the same school might have a different mix of abilities. All of which means that regression to the mean should show up in consecutive tests. In other words, low performing schools might show an improvement and high performing schools might show a degradation entirely due to random factors.

This isn't a theoretical example: regression to the mean has been clearly shown in school scores in Massachusetts, California and in other states (see Haney, Smith & Smith). Sadly, state politicians and civil servants have intervened based on scores and drawn conclusions where they shouldn't.

Children's education evokes a lot of emotion and political interest, which is not a good mix. It's important to understand concepts like regression to the mean so we can better understand what's really going on.

Heights example

"Regression to the mean" was originally called "regression to mediocrity", and was based on the study of human heights. If regression to mediocrity sounds very disturbing, it should do. It's closely tied to eugenics through Francis Galton. I'm not going to dwell on the links between statistics and eugenics here, but you should know the origins of statistics aren't sin free.

In 1880s England, Galton studied the heights of parents and their children. I've reproduced some of his results below. He found that parents who were above average height tended to have children closer to the average height, and that parent parents below average height tended to have children closer to the average height. This is the classic regression to the mean example. 

Think for the moment about possible different outcomes of a study like this. If taller parents had taller children, and shorter parents had shorter children, then we might expect to see two population groups emerging (short people and tall people) and maybe the start of speciation. Conversely, if tall parents had short children, and short parents had tall children, this would be very noticeable and commented on. Regression to the mean turns out to be a good explanation of what we observe in nature.

Galton's height study was very influential for both the study of genetics and the creation of statistics as a discipline.

New sports players

Let's take a cohort of baseball players in their first season. Obviously, talent makes a difference, but there are random factors at play too. We might expect some players to do extremely well, others to do well, some to do OK, some to do poorly, and some to do very poorly.  Regression to the mean tells us that some standout players may well perform worse the next year. Other, lower-ranked players will perform better for the same reason. The phenomena of new outstanding players performing worse in their second year is often called the "sophomore slump" and a lot has been written about it, but in reality, it can mostly be explained by regression to the mean.

You can read more about regression to the mean in sports here:

Business books

Popular business books often fall into the regression to the mean trap. Here's what happens. A couple of authors do an analysis of top performing businesses, usually measured by stock price, and find some commonalities. They develop these commonalities into a framework and write a best-selling business book whose thesis is, if you follow the framework, you'll be successful. They follow this with another book that's not quite as good. Then they write a third book that only the true believers read.

Unfortunately, the companies they select as winners don't do as well over a decade or more, and the longer the timescale, the worse the performance. Over the long-run, the authors' promise that they've found the elixir of success is shown to be not true. Their books go from the best seller list to the remainder bucket.

A company's stock price is determined by many factors, for example, its competitors, the market state, and so on. Only some of them are under the control of the company. Conditions change over time in unpredictable ways.  Regression to the mean suggests that great stock price performers now might not be in future, and low performers may do better. Regression to the mean neatly explains why picking winners today des not mean the same companies will be winners in the years to come. In other words, basic statistics makes a mockery of many business books.

Reading more:

  • The Halo Effect: . . . and the Eight Other Business Delusions That Deceive Managers - Phil Rosenzweig 

My experience

I've seen regression to the mean pop up in all kinds of business data sets and I've seen people make the classic mistake of trying to derive meaning from randomness. Here are some examples.

Sales data has a lot of random fluctuations, and of course, the smaller the sample, the greater the fluctuations. I've seen sales people have a stand out year followed by a very average year and vice versa. I've seen the same pattern at a regional and country level too. Unfortunately, I've also seen analysts tie themselves in knots trying to explain these patterns. Even worse, they've made foolish predictions based on small sample sets and just a few years' worth of data.

I've seen very educated people get very excited by changes in company assessment data. They think they've spotted something significant because companies that performed well one year tended to perform a bit worse the next etc. Regression to the mean explained all the data.

How not to be fooled

Regression to the mean is hidden in lots of data sets and can lead you into making poor decisions. If you're analyzing a dataset, here are some questions to ask:

  • Is your data the result of some kind of sampling process? 
  • Does randomness play a part in your collection process or in the data?
  • Are there unknowns that might influence your data?

If the answer to any of these questions is yes, you should assume you'll find regression to the mean in your dataset. Be careful about your analysis and especially careful about explaining trends. Of course, the smaller your data set, the more vulnerable you are.

You can estimate the effect of regression to the mean on your data using a variety of methods. I'm not going to go into them too much here because I don't want to make this blog post too long. In the literature, you'll see references on running a randomized control trial (RCT) also known as an A/B test. That's great in theory, but the reality is that it's not appropriate for most business situations. In practice, you'll have to run simulations or do some straightforward estimation of the fractional regression to the mean.

Monday, March 10, 2025

Everything you wanted to know about the normal distribution but were afraid to ask

Normal is all around you, and so is not-normal

The normal distribution is the most important statistical distribution. In this blog post, I'm going to talk about its properties, where it occurs, and why it's so very important. I'm also going to talk about how using the normal distribution when you shouldn't can lead to disaster and what you can do about it.

(Ainali, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons)

A rose by any other name

The normal distribution has a number of different names in different disciplines:

  • Normal distribution. This is the name used by statisticians and data scientists.
  • Gaussian distribution. This is what physicists call it.
  • The bell curve. The names used by social scientists and by people who don't understand statistics.

I'm going to call it the normal distribution in this blog post, and I'd advise you to call it this too. Even if you're not a data scientist, using the most appropriate name helps with communication.

What it is and what it looks like

When we're measuring things in the real world, we see different values. For example, if we measure the heights of 10 year old boys in a town, we'd see some tall boys, some short boys, and most boys around the "average" height. We can work out what fraction of boys are a certain height and plot a chart of frequency on the y axis and height on the x axis. This gives us a probability or frequency distribution. There are many, many different types of probability distribution, but the normal distribution is the most important.

(As an aside, you may remember making histograms at school. These are "sort-of" probability distributions. For example, you might have recorded the height of all the children in a class, grouped them into height ranges, counted the number of children in each height range, and plotted the chart. The y axis would have been a count of how many children in that height range. To turn this into a probability distribution, the y axis would become the fraction of all children in that height range. )

Here's what a normal probability distribution looks like. Yes, it's the classic bell curve shape which is exactly symmetrical.


The formula describing the curve is quite complex, but all you need to know for now is that it's described by two numbers: the mean (often written \(\mu\)) and a standard deviation (often written \(\sigma\)). The mean tells you where the peak is and the standard deviation gives you a measure of the width of the curve. 

To greatly summarize: values near the mean are the most likely to occur and the further you go from the mean, the less likely they are. This lines up with our boys' heights example: there aren't many very short or very tall boys and most boys are around the mean height.

Obviously, if you change the mean or the standard deviation, you change the curve, for example, you can change the location of the mean or you can make the curve wider or narrower. It turns out changing the mean and standard deviation just scales the curve because of its underlying mathematical properties. Most distributions don't behave like this; changing parameters can greatly change the entire shape of the distribution (for example, the beta distribution wildly changes shape if you change its parameters). The normal scaling property has some profound consequences, but for now, I'll just focus on one. We can easily map all normal distributions to one standard normal distribution. Because the properties of the standard normal are known, we can easily do math on the standard normal. To put it another way, it greatly speeds up what we need to do.

Why the normal distribution is so important

Here are some normal distribution examples from the real world.

Let's say you're producing precision bolts. You need to supply 1,000 bolts of a precise specification to a customer. Your production process has some variability. How many bolts do you need to manufacture to get 1,000 good ones? If you can describe the variability using a normal distribution (which is the case for many manufacturing processes), you can work out how many you need to produce.

Imagine you're outfitting an army and you're buying boots. You want to buy the minimum number of boots while still fitting everyone. You know that many body dimensions follow the normal distribution (most famously, chest circumference), so you can make a good estimate of how many boots of different sizes to buy.

Finally, let's say you've bought some random stocks. What might the daily change in value be? Under usual conditions, the change in value follows a normal distribution, so you can estimate what your portfolio might be worth tomorrow.

It's not just these three examples, many phenomena in different disciplines are well described by the normal distribution.

The normal distribution is also common because of something called the central limit theorem (CLT). Let's say I'm taking measurement samples from a population, e.g. measuring the speed of cars on a freeway. The CLT says that the distribution of the sample means will follow a normal distribution regardless of the underlying distribution.  In the car speed example, I don't know how the speeds are distributed, but I can calculate a mean and know how certain I am that the mean value is the true (population) mean. This sounds a bit abstract but it has profound consequences in statistics and means that normal distribution comes up time and time again.

Finally, it's important because it's so well-known. The math to describe and use the normal distribution has been known for centuries. It's been written about in hundreds of textbooks in different languages. More importantly, it's very widely taught; almost all numerate degrees will cover it and how to use it. 

Let's summarize why it's important:

  • It comes up in nature, in finance, in manufacturing etc.
  • It comes up because of the CLT.
  • The math to use it is standardized and well-known.

What useful things can I do with the normal distribution?

Let's take an example from the insurance world. Imagine an insurance company insures house contents and cars. Now imagine the claim distribution for cars follows a normal distribution and the claims distribution for house contents also follows a normal distribution. Let's say in a typical year the claims distributions look something like this (cars on the left, houses on the right).

(The two charts look identical except for the numbers on the x and y axis. That's expected. I said before that all normal distributions are just scaled versions of the standard normal. Another way of saying this is, all normal distribution plots look the same.)

What does the distribution look like for cars plus houses?

The long winded answer is to use convolution (or even Monte Carlo). But because the house and car distribution are normal, we can just do:

\(\mu_{combined} = \mu_{houses} + \mu_{cars} \)

\(\sigma_{combined}^2 = \sigma_{houses}^2 + \sigma_{cars}^2\)

So we can calculate the combined distribution in a heartbeat. The combined distribution looks like this (another normal distribution, just with a different mean and standard deviation).

To be clear: this only works because the two distributions were normal.

It's not just adding distributions together. The normal distribution allows for shortcuts if we're multiplying or dividing etc. The normal distribution makes things that would otherwise be hard very fast and very easy.

Some properties of the normal distribution

I'm not going to dig into the math here, but I am going to point out a few things about the distribution you should be aware of.

The "standard normal" distribution goes from \(-\infty\) to \(+\infty\). The further away you get from the mean, the lower the probability, and once you go several standard deviations away, the probability is quite small, but never-the-less, it's still present. Of course, you can't show \(\infty\) on a chart, so most people cut off the x-axis at some convenient point. This might give the misleading impression that there's an upper or lower x-value; there isn't. If your data has upper or lower cut-off values, be very careful modeling it using a normal distribution. In this case, you should investigate other distributions like the truncated normal.

The normal distribution models continuous variables, e.g. variables like speed or height that can have any number of decimal places (but see the my previous paragraph on \(\infty\)). However, it's often used to model discrete variables (e.g. number of sheep, number of runs scored, etc.). In practice, this is mostly OK, but again, I suggest caution.

Abuses of the normal distribution and what you can do

Because it's so widely known and so simple to use, people have used it where they really shouldn't. There's a temptation to assume the normal when you really don't know what the underlying distribution is. That can lead to disaster.

In the financial markets, people have used the normal distribution to predict day-to-day variability. The normal distribution predicts that large changes will occur with very low probability; these are often called "black swan events". However, if the distribution isn't normal, "black swan events" can occur far more frequently than the normal distribution would predict. The reality is, financial market distributions are often not normal. This creates opportunities and risks. The assumption of normality has lead to bankruptcies.

Assuming normality can lead to models making weird or impossible predictions. Let's say I assume the numbers of units sold for a product is normally distributed. Using previous years' sales, I forecast unit sales next year to be 1,000 units with a standard deviation of 500 units. I then create a Monte Carlo model to forecast next years' profits. Can you see what can go wrong here? Monte Carlo modeling uses random numbers. In the sales forecast example, there's a 2.28% chance the model will select a negative sales number which is clearly impossible. Given that Monte Carlo models often use tens of thousands of simulations, it's extremely likely the final calculation will have been affected by impossible numbers.  This kind of mistake is insidious and hard to spot and even experienced analysts make it.

If you're a manager, you need to understand how your team has modeled data. 

  • Ask what distributions they've used to model their data. 
  • Ask them why they've used that distribution and what evidence they have that the data really is distributed that way. 
  • Ask them how they're going to check their assumptions. 
  • Most importantly, ask them if they have any detection mechanism in place to check for deviation from their expected distribution.

History - where the normal came from

Rather unsatisfactorily, there's no clear "Eureka!" moment for the discovery of the distribution, it seems to have been the accumulation of the work of a number of mathematicians. Abraham de Moivre  kicked off the process in 1733 but didn't formalize the distribution, leaving Gauss to explicitly describe it in 1801 [https://medium.com/@will.a.sundstrom/the-origins-of-the-normal-distribution-f64e1575de29].

Gauss used the normal distribution to model measurement errors and so predict the path of the asteroid Ceres [https://en.wikipedia.org/wiki/Normal_distribution#History]. This sounds a bit esoteric, but there's a point here that's still relevant. Any measurement taking process involves some form of error. Assuming no systemic bias, these errors are well-modeled by the normal distribution. So any unbiased measurement taking today (e.g opinion polling, measurements of particle mass, measurement of precision bolts, etc.) uses the normal distribution to calculate uncertainty.

In 1810, Laplace placed the normal distribution at the center of statistics by formulating the Central Limit Theorem. 

The math

The probability distribution function is given by:

\[f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e ^ {-\frac{1}{2} ( \frac{x - \mu}{\sigma}) ^ 2  }\]

\(\sigma\) is the standard deviation and \(\mu\) is the mean. In the normal distribution, the mean is the same as the mode is the same as the median.

This formula is almost impossible to work with directly, but you don't need to. There are extensive libraries that will do all the calculations for you.

Adding normally distributed parameters is easy:

\(\mu_{combined} = \mu_{houses} + \mu_{cars} \)

\(\sigma_{combined}^2 = \sigma_{houses}^2 + \sigma_{cars}^2\)

Wikipedia has an article on how to combine normally distributed quantities, e.g. addition, multiplication etc. see  https://en.wikipedia.org/wiki/Propagation_of_uncertainty.

Thursday, February 13, 2025

Why assuming independence can be very, very bad for business

Independence in probability

Why should I care about independence?

Many models in the finance industry and elsewhere assume events are independent. When this assumptions fails, catastrophic losses can occur as we saw in 2008 and 1992. The problem is, developers and data scientists assume independence because it greatly simplifies problems, but the executive team often don't know this has happened, or even worse, don't understand what it means. As a result, the company ends up being badly caught out when circumstances change and independence no longer applies.

(Sergio Boscaino from Busseto, Italy, CC BY 2.0 , via Wikimedia Commons)

In this post, I'm going to explain what independence is, why people assume it, and how it can go spectacularly wrong. I'll provide a some guidance for managers so they know the right questions to ask to avoid disaster. I've pushed the math to the end, so if math isn't your thing, you can leave early and still get benefit.

What is independence?

Two events are independent if the outcome of one doesn't affect the other in any way. For example, if I throw two dice, the probability of me throwing a six on the second die isn't affected in any way by what I throw on the first die. 

Here are some examples of independent events:

  • Throwing a coin and getting a head, throwing a dice and getting a two.
  • Drawing a king from a deck of cards, winning the lottery having bought a ticket.
  • Stopping at at least one red light on my way to the store, rain falling two months from now.
By contrast, some events are not independent (they're dependent):
  • Raining today and raining tomorrow. Rain today increases the chances of rain tomorrow.
  • Heavy snow today and a football match being played. Heavy snow will cause the match to be postponed.
  • Drawing a king from a deck of cards, then without replacing the card, drawing a king on the second draw.

Why people assume independence

People assume independence because the math is a lot, lot simpler. If two events are dependent, the analyst has to figure out the relationship between them, something that can be very challenging and time consuming to do. Other than knowing there's a relationship, the analyst might have no idea what it is and there may be no literature to guide them. If you have no idea what the relationship is, it's easier to assume there's none.

Sometimes, analysts assume independence because they don't know any better. If they're not savvy about probability theory, they may do a simple internet search on combining probabilities that will suggest all they have to do is multiple probabilities, and they assume they can use this process all the time. This is assuming independent through ignorance. I believe people are making this mistake in practice because I've interviewed candidates with MS degrees in statistics who've made this kind of blunder.

Money and fear can also drive the choice to assume independence. Imagine you're an analyst. Your manager is on your back to deliver a model as soon as possible. If you assume independence, your model will be available on time and you'll get your bonus, if you don't, you won't hit your deadline and you won't get your bonus. Now imagine the bad consequences of assuming independence won't be visible for a while. What would you do?

Harder examples

Do you think the following are independent?

  • Two unrelated people in different towns defaulting on their mortgage at the same time
  • Houses in different towns suffering catastrophic damage (e.g. fire, flood, etc.)

Most of the time, these events will be independent. For example, a house burning down because of poor wiring doesn't tell you anything about the risk of a house burning down in a different town (assuming a different electrician!). But there are circumstances when the independence assumption fails:

  • A hurricane hits multiple towns at once causing widespread catastrophic damage in different insurance categories (e.g. hurricane Andrew in 1992).
  • A recession hits, causing widespread lay-offs and mortgage defaults, especially for sub-prime mortgages (2008).

Why independence fails

Prior to 1992, the insurance industry had relatively simple risk models. They assumed independence; an assumption that worked well for some time. In an average year, they knew roughly how many claims there would be for houses, cars etc. Car insurance claims were independent of house insurance claims that in turn were independent of municipal and corporate insurance claims and so on. They were independent, at least in part, because there was no common causal mechanism.

When hurricane Andrew hit Florida in 1992, it  destroyed houses, cars, schools, hospitals etc. across multiple towns. The assumption of independence just wasn't true in this case. The insurance claims were sky high and bankrupted several companies. 

(Hurricane Andrew, houses destroyed in Dade County, Miami. Image from FEMA. Source: https://commons.wikimedia.org/wiki/File:Hurricane_andrew_fema_2563.jpg)

To put it simply, the insurance computer models didn't adequately model the risk because they had independence baked in.  

Roll forward 15 years and something similar happened in the financial markets. Sub-prime mortgage lending was build on a set of assumptions, including default rates. The assumption was, mortgage defaults were independent of one another. Unfortunately, as the 2008 financial crisis hit, this was no longer valid. As more people were laid-off, the economy went down, so more people were laid-off. This was often called contagion but perhaps a better analogy is the reverse of a well known saying: "a rising tide floats all boats".


Financial Crisis Newspaper
(Image credit: Secret London 123, CC BY-SA 2.0, via Wikimedia Commons)

The assumption of independence simplified the analysis of sub-prime mortgages and gave the results that people wanted. The incentives weren't there to price in risk. Imagine your company was making money hand over fist and you stood up and warned people of the risks of assuming independence. Would you put your bonus and your future on the line to do so?

What to do - recommendations

Let's live in the real world and accept that assuming independence gets us to results that are usable by others quickly.

If you're a developer or a data scientist, you must understand the consequences of assuming independence and you must recognize that you're making that assumption.  You must also make it clear what you've done to your management.

If you're a manager, you must be aware that assuming independence can be dangerous but that it gets results quickly. You need to ask your development team about the assumptions they're making and when those assumptions fail. It also means accepting your role as a risk manager; that means investing in development to remove independence.

To get results quickly, it may well be necessary for an analyst to assume independence.  Once they've built the initial model (a proof of concept) and the money is coming in, then the task is to remove the independence assumption piece-by-piece. The mistake is to stop development.

The math

Let's say we have two events, A and B, with probabilities of occurring P(A) and P(B). 

If the events are independent, then the probability of them both occurring is:

\[P(A \ and \ B) = P(A  \cap B) = P(A) P(B)\]

This equation serves as both a definition of independence and test of independence as we'll see next.

Let's take two cases and see if they're independent:

  1. Rolling a dice and getting a 1 and a 2
  2. Rolling a dice and getting a (1 or 2) and (2, 4, or 6)

For case 1, here are the probabilities:
  • \(P(A) = 1/6\)
  • \(P(B) = 1/6\)
  • \(P(A  \cap B) = 0\), it's not possible to get 1 and 2 at the same time
  • \(P(A )P(B) = (1/6) * (1/6)\)
So the equation \(P(A \ and \ B) = P(A  \cap B) = P(A) P(B)\) isn't true, therefore the events are not independent.

For case 2, here are the probabilities:
  • \(P(A) = 1/3\)
  • \(P(B) = 1/2\)
  • \(P(A  \cap B) = 1/6\)
  • \(P(A )P(B) = (1/2) * (1/3)\)
So the equation is true, therefore the events are independent.

Dependence uses conditional probability, so we have this kind of relationship:
\[P(A \ and \ B) = P(A  \cap B) = P(A | B) P(B)\]
The expression \(P(A | B)\) means the probability of A given that B has occurred (e.g the probability the game is canceled given that it's snowed). There are a number of ways to approach finding \(P(A | B)\), the most popular over the last few years has been Bayes' Theorem which states:
\[P(A | B) = \frac{P(B | A) P(A)}{P(B)}\]
There's a whole methodology that goes with the Bayesian approach and I'm not going to go into it here, except to say that it's often iterative; we make an initial guess and progressively refine it in the light of new evidence. The bottom line is, this process is much, much harder and much more expensive than assuming independence. 

Monday, July 31, 2023

Essential business knowledge: the Central Limit Theorem

Knowing the Central Limit Theorem means avoiding costly mistakes

I've spoken to well-meaning analysts who've made significant mistakes because they don't understand the implications of one of the core principles of statistics; the Central Limit Theorem (CLT). These errors weren't trivial either, they affected salesperson compensation and the analysis of A/B tests. More personally, I've interviewed experienced candidates who made fundamental blunders because they didn't understand what this theorem implies.

The CLT is why the mean and standard deviation work pretty much all the time but it's also why they only work when the sample size is "big enough". It's why when you're estimating the population mean it's important to have as large a sample size as you can. It's why we use the Student's t-test for small sample sizes and why other tests are appropriate for large sample sizes. 

In this blog post, I'm going to explain what the CLT is, some of the theory behind it (at a simple level), and how it drives key business statistics. Because I'm trying to communicate some fundamental ideas, I'm going to be imprecise in my language at first and add more precision as I develop the core ideas. As a bonus, I'll throw in a different version of the CLT that has some lesser-known consequences.

How we use a few numbers to represent a lot of numbers

In all areas of life, we use one or two numbers to represent lots of numbers. For example, we talk about the average value of sales, the average number of goals scored per match, average salaries, average life expectancy, and so on. Usually, but not always, we get these numbers through some form of sampling, for example, we might run a salary survey asking thousands of people their salary, and from that data work out a mean (a sampling mean). Technically, the average is something mathematicians call a "measure of central tendency" which we'll come back to later.

We know not everyone will earn the mean salary and that in reality, salaries are spread out. We express the spread of data using the standard deviation. More technically, we use something called a confidence interval which is based on the standard deviation. The standard deviation (or confidence interval) is a measure of how close we think our sampling mean is to the true (population) mean (how confident we are).

In practice, we use standard formula for the mean and standard deviation. These are available as standard functions in spreadsheets and programming languages. Mathematically, this is how they're expressed.

\[sample\; mean\; \bar{x}= \frac{1}{N}\sum_{i=0}^{N}x_i\]

\[sample\; standard\; deviation\; s_N = \sqrt{\frac{1}{N} \sum_{i=0}^{N} {\left ( x_i - \bar{x} \right )} ^ 2 } \]

All of this seems like standard stuff, but there's a reason why it's standard, and that's the central limit theorem (CLT).

The CLT

Let's look at three different data sets with different distributions: uniform, Poisson, and power law as shown in the charts below.

These data sets are very, very different. Surely we have to have different averaging and standard deviation processes for different distributions? Because of the CLT, the answer is no. 

In the real world, we sample from populations and take an average (for example, using a salary survey), so let's do that here. To get going, let's take 100 samples from each distribution and work out a sample mean. We'll do this 10,000 times so we get some kind of estimate for how spread out our sample means are.

The top charts show the original population distribution and the bottom charts show the result of this sampling means process. What do you notice?

The distribution of the sampling means is a normal distribution regardless of the underlying distribution.

This is a very, very simplified version of the CLT and it has some profound consequences, the most important of which is that we can use the same averaging and standard deviation functions all the time.

Some gentle theory

Proving the CLT is very advanced and I'm not going to do that here. I am going to show you through some charts what happens as we increase the sample size.

Imagine I start with a uniform random distribution like the one below. 

I want to know the mean value, so I take some samples and work out a mean for my samples. I do this lots of times and work out a distribution for my mean. Here's what the results look like for a sample size of 2, 3,...10,...20,...30,...40. 

As the sample size gets bigger, the distribution of the means gets closer to a normal distribution. It's important to note that the width of the curve gets narrower with increasing sample size. Once the distribution is "close enough" to the normal distribution (typically, around a sample size of 30), you can use normal distribution methods like the mean and standard deviation.

The standard deviation is a measure of the width of the normal distribution. For small sample sizes, the standard deviation underestimates the width of the distribution, which has important consequences.

Of course, you can do this with almost any underlying distribution, I'm just using a uniform distribution because it's easier to show the results 

Implications for averages

The charts above show how the distribution of the means changes with sample size. At low sample sizes, there are a lot more "extreme" values as the difference between the sample sizes of 2 and 40 shows.  Bear in mind, the width of the distribution is an estimate of the uncertainty in our measurement of the mean.

For small sample sizes, the mean is a poor estimator of the "average" value; it's extremely prone to outliers as the shape of the charts above indicates. There are two choices to fix the problem: either increase the sample size to about 30 or more (which often isn't possible) or use the median instead (the median is much less prone to outliers, but it's harder to calculate).

The standard deviation (and the related confidence interval) is a measure of the uncertainty in the mean value. Once again, it's sensitive to outliers. For small sample sizes, the standard deviation is a poor estimator for the width of the distribution. There are two choices to fix the problem, either increase the sample size to 30 or more (which often isn't possible) or use quartiles instead (for example, the interquartile range, IQR).

If this sounds theoretical, let me bring things down to earth with an example. Imagine you're evaluating salesperson performance based on deals closed in a quarter. In B2B sales, it's rare for a rep to make 30 sales in a quarter, in fact, even half that number might be an outstanding achievement. With a small number of samples, the distribution is very much not normal, and as we've seen in the charts above, it's prone to outliers. So an analysis based on mean sales with a standard deviation isn't a good idea; sales data is notorious for outliers. A much better analysis is the median and IQR. This very much matters if you're using this analysis to compare rep performance.

Implications for statistical tests

A hundred years ago, there were very few large-scale tests, for example, medical tests typically involved small numbers of people. As I showed above, for small sample sizes the CLT doesn't apply. That's why Gosset developed the Student's t-distribution: the sample sizes were too small for the CLT to kick in, so he needed a rigorous analysis procedure to account for the wider-than-normal distributions. The point is, the Student's t-distribution applies when sample sizes are below about 30.

Roll forward 100 years and we're now doing retail A/B testing with tens of thousands of samples or more. In large-scale A/B tests, the z-test is a more appropriate test. Let me put this bluntly: why would you use a test specifically designed for small sample sizes when you have tens of thousands of samples?

It's not exactly wrong to use the Student's t-test for large sample sizes, it's just dumb. The special features of the Student's t-test that enable it to work with small sample sizes become irrelevant. It's a bit like using a spanner as a hammer; if you were paying someone to do construction work on your house and they were using the wrong tool for something simple, would you trust them with something complex?

I've asked about statistical tests at interview and I've been surprised at the response. Many candidates have immediately said Student's t as a knee-jerk response (which is forgivable). Many candidates didn't even know why Student's t was developed and its limitations (not forgivable for senior analytical roles). One or two even insisted that Student's t would still be a good choice even for sample sizes into the hundreds of thousands. It's very hard to progress candidates who insist on using the wrong approach even after it's been pointed out to them.

As a practical matter, you need to know what statistical tools you have available and their limitations.

Implications for sample sizes

I've blithely said that the CLT applies above a sample size of 30. For "most" distributions, a sample size of about 30 is a reasonable rule-of-thumb, but there's no theory behind it. There are cases where a sample size of 30 is insufficient. 

At the time of writing, there's a discussion on the internet about precisely this point. There's a popular article on LessWrong that illustrates how quickly convergence to the normal can happen: https://www.lesswrong.com/posts/YM6Qgiz9RT7EmeFpp/how-long-does-it-take-to-become-gaussian but there's also a counter article that talks about cases where convergence can take much longer: https://two-wrongs.com/it-takes-long-to-become-gaussian

The takeaway from this discussion is straightforward. Most of the time, using a sample size of 30 is good enough for the CLT to kick-in, but occasionally you need larger sample sizes. A good way to test this is to use larger sample sizes and see if there's any trend in the data. 

General implications

The CLT is a double-edged sword: it enables us to use the same averaging processes regardless of the underlying distribution, but it also lulls us into a false sense of security and analysts have made blunders as a result.

Any data that's been through an averaging process will tend to follow a normal distribution. For example, if you were analyzing average school test scores you should expect them to follow a normal distribution, similarly for transaction values by retail stores, and so on. I've seen data scientists claim brilliant data insights by announcing their data is normally distributed, but they got it through an averaging process, so of course it was normally distributed. 

The CLT is one of the reasons why the normal distribution is so prevalent, but it's not the only reason and of course, not all data is normally distributed. I've seen junior analysts make mistakes because they've assumed their data is normally distributed when it wasn't. 

A little more rigor

I've been deliberately loose in my description of the CLT so far so I can explain the general idea. Let's get more rigorous so we can dig into this a bit more. Let's deal with some terminology first.

Central tendency

In statistics, there's something called a "central tendency" which is a measurement that summarizes a set of data by giving a middle or central value. This central value is often called the average. More formally, there are three common measures of central tendency:

  • The mode. This is the value that occurs most often.
  • The median. Rank order the data and this is the middle value.
  • The mean. Sum up all the data and divide by the number of values.

These three measures of central tendency have different properties, different advantages, and different disadvantages. As an analyst, you should know what they are.

(Depending on where you were educated, there might be some language issues here. My American friends tell me that in the US, the term "average" is always a synonym for the mean, in Britain, the term "average" can be the mean, median, or mode but is most often the mean.)

For symmetrical distributions, like the normal distribution, the mean, median, and mode are the same, but that's not the case for non-symmetrical distributions. 

The term "central" in the central limit theorem is referring to the central or "average" value.

iid

If you were taught about the Central Limit Theorem, you were probably taught that it only applies to iid data, which means independent and identically distributed. Here's what iid means. 

  • Each sample in the data is independent of the other samples. This means selecting or removing a sample does not affect the value of another sample.
  • All the samples come from the same probability distribution.
Actually, this isn't true. The CLT applies even if the distributions are not the same. However, the independence requirement still holds,

When the CLT doesn't apply

Fortunately for us, the CLT applies to almost all distributions an analyst might come across, but there are exceptions. The underlying distribution must have a finite variance, which rules out using it with distributions like the Cauchy distribution. The samples must be iid as I said before.

A re-statement of the CLT

Given data that's distributed with a finite variance and is iid, if we take n samples, then:

  • as \( n \to \infty \), the sample mean converges to the population mean
  • as \( n \to \infty \), the distribution of the sample means approximates a normal distribution

Note this formulation is in terms of the mean. This version of the CLT also applies to sums because the mean is just the sum divided by a constant (the number of samples).

A different version of the CLT

There's another version of the CLT that's not well-known but does come up from time to time in more advanced analysis. The usual version of the CLT is expressed in terms of means (which is the sum divided by a constant). If instead of taking the sum of the samples, we take their product, then instead of the products tending to a normal distribution they tend to a log-normal distribution. In other words, where we have a quantity created from the product of samples then we should expect it to follow a log-normal distribution. 

What should I take away from all this?

Because of the CLT, the mean and standard deviation mostly work regardless of the underlying distribution. In other words, you don't have to know how your data is distributed to do basic analysis on it. BUT the CLT only kicks in above a certain sample size (which can vary with the underlying distribution but is usually around 30) and there are cases when it doesn't apply. 

You should know what to do when you have a small sample size and know what to watch out for when you're relying on the CLT.

You should also understand that any process that sums (or products) data will lead to a normal distribution (or log-normal).