An unfortunate phrase with unfortunate consequences

"Regression to the mean" is a simple idea that has profound consequences. It's led people astray for decades, if not centuries. I'm going to explain what it is, the consequences of not understanding it, and what you can do to protect yourself and your organization.

Let's give a simple definition for now: it's the tendency, when sampling data, for more extreme values to be followed by values closer to the mean. Here's an example, if I give the same children IQ tests over time, I'll see very high scores followed by more average scores, and some very low scores followed by more average scores. It doesn't mean the children are improving or getting worse, it's just regression to the mean. The problems occur when people attach a deeper meaning, as we'll see.

(Francis Galton, popularizer of "Regression to the mean")

What it means - simple examples

I'm going to start with an easy example that everyone should be familiar with, a simple game with a pack of cards.

Take a standard pack of playing cards and label the cards in each suit 1 to 13 (Ace is 1, 2 is 2, Jack is 11, etc.). The mean card value is 7.5.
Draw a card at random.
Imagine it's a Queen (12). Now, replace the card and draw another card. Is it likely the card will have a lower value or a higher value?

The probability is (11/13), it will have a lower value.

Now imagine you drew an ace (1), replace the card and draw again.

The probability of drawing another ace is 1/13.
The probability of drawing a 2 or higher is 12/13.

It's obvious in this example that "extreme" value cards are very likely to be followed by more "average" value cards. This is regression to the mean at work. It's nothing complex, just a probability distribution at work.

The cards example seems simple and obvious. Playing cards are very familiar and we're comfortable with randomness (in fact, almost all card games rely on randomness). The problem occurs when we have real measurements, we tend to give explanations to the data when randomness (and regression to the mean) is all that's there.

Let's say we're measuring the average speed of cars on a freeway. Here are 100 measurements of car speeds. What would you conclude about the freeway? What pattern can you see in the data and what does it tell you about driver behavior (e.g. lower speeds following higher speeds and vice versa)? What might cause it?

['46.7', '63.3', '80.0', '71.7', '34.2', '55.0', '67.5', '34.2', '67.5', '67.5', '59.2', '63.3', '55.0', '34.2', '63.3', '63.3', '63.3', '59.2', '75.8', '71.7', '42.5', '42.5', '34.2', '34.2', '59.2', '67.5', '59.2', '71.7', '71.7', '67.5', '50.8', '63.3', '34.2', '63.3', '30.0', '38.3', '50.8', '34.2', '75.8', '75.8', '46.7', '80.0', '55.0', '46.7', '38.3', '38.3', '75.8', '59.2', '34.2', '42.5', '71.7', '71.7', '80.0', '80.0', '71.7', '34.2', '63.3', '71.7', '46.7', '42.5', '46.7', '46.7', '63.3', '80.0', '80.0', '38.3', '38.3', '46.7', '38.3', '34.2', '46.7', '75.8', '55.0', '30.0', '55.0', '75.8', '30.0', '42.5', '67.5', '30.0', '50.8', '67.5', '67.5', '71.7', '67.5', '67.5', '42.5', '75.8', '75.8', '34.2', '55.0', '50.8', '38.3', '71.7', '46.7', '71.7', '50.8', '71.7', '42.5', '42.5']

Let's imagine the authorities introduced a speed camera at the measurement I've indicated in red. What might you conclude about the effect of the speed camera?

You shouldn't conclude anything at all from this data. It's entirely random. In fact, it has the same probability distribution as the pack of cards example. I've used 13 different average speeds, each with the same probability of occurrence. What you're seeing is the result of me drawing cards from a pack and giving them floating point numbers like 71.7 instead of a number like 9. The speed camera had no effect in this case. The data set shows the regression to the mean and nothing more.

The pack of cards and the vehicles example are exactly the same example. In the pack of cards case, we understand randomness and we can intuitively see what regression to the mean actually means. Once we have a real world problem, like the cars on the freeway, our tendency is to look for explanations that aren't there and we discount randomness. Looking for meaning in random data has had bad consequences, as we'll see.

Schools example

In the last few decades in the US, several states have introduced standardized testing to measure school performance. Students in the same year group take the same test and, based on the results, the state draws conclusions about the relative standing of schools; it may intervene in low performing schools. The question is, how do we measure the success of these interventions? Surely, we would expect to see an improvement in test scores taken the next year? In reality, it's not so simple.

The average test result for a group of students will obviously depend on things like teaching, prior attainment etc. But there are also random factors at work. Individual students might perform better or worse than expected due to sickness, or family issues, or a host of other random issues. Of course, different year groups in the same school might have a different mix of abilities. All of which means that regression to the mean should show up in consecutive tests. In other words, low performing schools might show an improvement and high performing schools might show a degradation entirely due to random factors.

This isn't a theoretical example: regression to the mean has been clearly shown in school scores in Massachusetts, California and in other states (see Haney, Smith & Smith). Sadly, state politicians and civil servants have intervened based on scores and drawn conclusions where they shouldn't.

Children's education evokes a lot of emotion and political interest, which is not a good mix. It's important to understand concepts like regression to the mean so we can better understand what's really going on.

Heights example

"Regression to the mean" was originally called "regression to mediocrity", and was based on the study of human heights. If regression to mediocrity sounds very disturbing, it should do. It's closely tied to eugenics through Francis Galton. I'm not going to dwell on the links between statistics and eugenics here, but you should know the origins of statistics aren't sin free.

In 1880s England, Galton studied the heights of parents and their children. I've reproduced some of his results below. He found that parents who were above average height tended to have children closer to the average height, and that parent parents below average height tended to have children closer to the average height. This is the classic regression to the mean example.

Think for the moment about possible different outcomes of a study like this. If taller parents had taller children, and shorter parents had shorter children, then we might expect to see two population groups emerging (short people and tall people) and maybe the start of speciation. Conversely, if tall parents had short children, and short parents had tall children, this would be very noticeable and commented on. Regression to the mean turns out to be a good explanation of what we observe in nature.

Galton's height study was very influential for both the study of genetics and the creation of statistics as a discipline.

New sports players

Let's take a cohort of baseball players in their first season. Obviously, talent makes a difference, but there are random factors at play too. We might expect some players to do extremely well, others to do well, some to do OK, some to do poorly, and some to do very poorly. Regression to the mean tells us that some standout players may well perform worse the next year. Other, lower-ranked players will perform better for the same reason. The phenomena of new outstanding players performing worse in their second year is often called the "sophomore slump" and a lot has been written about it, but in reality, it can mostly be explained by regression to the mean.

You can read more about regression to the mean in sports here:

Business books

Popular business books often fall into the regression to the mean trap. Here's what happens. A couple of authors do an analysis of top performing businesses, usually measured by stock price, and find some commonalities. They develop these commonalities into a framework and write a best-selling business book whose thesis is, if you follow the framework, you'll be successful. They follow this with another book that's not quite as good. Then they write a third book that only the true believers read.

Unfortunately, the companies they select as winners don't do as well over a decade or more, and the longer the timescale, the worse the performance. Over the long-run, the authors' promise that they've found the elixir of success is shown to be not true. Their books go from the best seller list to the remainder bucket.

A company's stock price is determined by many factors, for example, its competitors, the market state, and so on. Only some of them are under the control of the company. Conditions change over time in unpredictable ways. Regression to the mean suggests that great stock price performers now might not be in future, and low performers may do better. Regression to the mean neatly explains why picking winners today des not mean the same companies will be winners in the years to come. In other words, basic statistics makes a mockery of many business books.

Reading more:

The Halo Effect: . . . and the Eight Other Business Delusions That Deceive Managers - Phil Rosenzweig

My experience

I've seen regression to the mean pop up in all kinds of business data sets and I've seen people make the classic mistake of trying to derive meaning from randomness. Here are some examples.

Sales data has a lot of random fluctuations, and of course, the smaller the sample, the greater the fluctuations. I've seen sales people have a stand out year followed by a very average year and vice versa. I've seen the same pattern at a regional and country level too. Unfortunately, I've also seen analysts tie themselves in knots trying to explain these patterns. Even worse, they've made foolish predictions based on small sample sets and just a few years' worth of data.

I've seen very educated people get very excited by changes in company assessment data. They think they've spotted something significant because companies that performed well one year tended to perform a bit worse the next etc. Regression to the mean explained all the data.

How not to be fooled

Regression to the mean is hidden in lots of data sets and can lead you into making poor decisions. If you're analyzing a dataset, here are some questions to ask:

Is your data the result of some kind of sampling process?
Does randomness play a part in your collection process or in the data?
Are there unknowns that might influence your data?

If the answer to any of these questions is yes, you should assume you'll find regression to the mean in your dataset. Be careful about your analysis and especially careful about explaining trends. Of course, the smaller your data set, the more vulnerable you are.

You can estimate the effect of regression to the mean on your data using a variety of methods. I'm not going to go into them too much here because I don't want to make this blog post too long. In the literature, you'll see references on running a randomized control trial (RCT) also known as an A/B test. That's great in theory, but the reality is that it's not appropriate for most business situations. In practice, you'll have to run simulations or do some straightforward estimation of the fractional regression to the mean.

Engora Data Blog

Friday, October 10, 2025

Regression to the mean