Showing posts with label data analysis. Show all posts

Thursday, November 6, 2025

How to get data analysis very wrong: sample size effects

We're not reading the data right

In the real world, we’re under pressure to get results from data analysis. Sometimes, the pressure to deliver certainty means we forget some of the basics of analysis. In this blog post, I’m going to talk about one pitfall you can make which can cause you to give wildly wrong answers. I’ll start with an example.

School size - smaller schools are better?

You’ve probably heard the statement that “small schools produce better results than large schools”. Small-school advocates point out that small schools disproportionately appear in the top performing groups in an area. It sounds like small schools are the way to go, or are they? It’s also true that small schools disproportionately appear among the worse schools in an area. So, which is it, are small schools better or worse?

The answer is: both. Small schools have a higher variation in results because they have fewer students. The results are largely due to “statistical noise” [1].

We can easily see the effects of sample size “statistical noise”, more properly called variance, in a very simple example. Imagine tossing a coin and scoring heads a 1 and tails a 0. You would expect the mean over many tosses to be close to 0.5, but how many tosses do you have to do? I wrote a simple program to simulate tossing a coin and I summed up the results as a I went along. The charts below show four simulations. The x axis of each chart is the number of tosses, the y axis is the running mean, the blue line is the simulation, and the red dotted line is 0.5.

The charts clearly show higher variance at low numbers of simulations. It takes a surprisingly large number of tosses for the mean to get close top 0.5. If we want more certainty, and less variance, we need bigger samples sizes.

We can repeat the experiment, but this time with a six-sided dice and record the running mean. We’d see the same result, more variance for shorter simulations. Let’s try a more interesting example (you’ll see why in a minute). Let’s imagine a 100-sided dice and run the experiment multiple times , recording the mean results after each simulation (I’ve shown a few runs here).

Let’s change the terminology a bit here. The 100-sided dice is a percentage test result. Each student rolls the dice. If there are 100 students in a school, there are 100 die rolls, if there are 1,500 students in the school, we roll the die 1,500 times. We now have a simulation of school test results and the effect of school size.

I simulated 500 schools with 500 to 1,500 students. Here are the results.

As you can see, there’s more variance for shorter smaller schools than larger schools. This neatly explains why smaller schools are both the best in an area and the worst.

You might object to the simplicity of my analysis, surely real school results don't look like this.What does real-world data show? Wainer [1] did the work and got the real results (read his paper for more detials). Here's a screen shot from his paper showing real-world school results. It looks a lot like my simple-minded simulation.

Sample size variation is not the full explanation for school results, but it is a factor. Any analysis has to take it into account. Problems occur because of simple (wrong) analysis and overly-simple conclusions.

The law of large numbers

The effect that variance goes down with increasing sample size is known as the law of large numbers. It’s widely taught and there’s a lot written about it online. Unfortunately, most of the discussions get lost in the weeds very quickly. These two references do a very good job of explaining what’s going on: [1] [2].

The law of large numbers has a substantial body of mathematical theory behind it. It has an informal counter-part, that's a bit easier to understand, called the law of small numbers that says that there’s more variance in smaller samples than large ones. Problems occur because people assume that small samples behave in the same way as larger samples (small school results have the same variance as large school results for example).

So far, this sounds simple and obvious, but in reality, most data analysts aren’t fully aware of the effect of sample size. It doesn’t help that the language used in the real-world doesn’t conform to the language used in the classroom.

Small sales territories are the best?

Let’s imagine you were given some sales data on rep performance for an American company and you were asked to find factors that led to better performance.

Most territories have about 15-20 reps, with a handful having 5 or less reps. The top 10 leader board for the end of the year shows you that the reps from the smaller territories are doing disproportionally well.The sales VP is considering changing her sales organization to create smaller territories and she wants you to confirm what she’s seen in the data. Should she re-organize to smaller territories to get better results?

Obviously, I’ve prepped you with answer, but if I hadn’t, would you have concluded smaller territories are the way to go?

Rural lives are healthier

Now imagine you’re an analyst in a health insurance company in the US. You’ve come across data on the prevalence on kidney cancer by US county. You’ve found that the lowest prevalence is in rural counties. Should you set company policy based on this data? It seems obvious that the rural lifestyle is healthier. Should health insurance premiums include a rural/urban cost difference?

I’ve taken this example from the paper by Wainer [1]. As you might have guessed, rural counties have both the lowest and the highest rates of kidney cancer because their populations are small, so the law of small numbers kicks in. I’ve reproduced Wainer’s chart here: the x axis is county population and the y-axis is cancer rate, see his paper for more about the chart. It’s a really great example of the effect of sample size on variance.

A/B test hell

Let’s take a more subtle example. You’re running an A/B test that’s inconclusive. The results are really important to the company. The CMO is telling everyone that all the company needs to do is run the test for a bit longer. You are the analyst and you’ve been asked if running more tests is the solution. What do you say?

The only time it's worth running the test a bit longer is if the test is on the verge of significance. Other than that, it's probably not worth it. Belle's book [3] has a nice chapter on sample size calculations you can access for free online [4]. The bottom line is, the smaller the effect, the larger the sample size you need for significance. The relationship isn't linear. I've seen A/B tests that would have to run for over a year to reach significance.

Surprisingly, I've seen analysts who don't know how to do a sample size/duration estimate for an A/B test. That really isn't a good place to be when the business is relying on you for answers,

The missing math

Because I’m aiming for a more general audience, I’ve been careful here not to include equations. If you’re an analyst, you need to know:

What variance is and how to calculate it.
How sample size can affect results - you need to look for it everywhere.
How to estimate how much of what you're seeing is due to sample size effects and how much due to something "real".

Unfortunately, references for the law of large numbers get overly technical overly quickly. A good place to start is references that cover variance and standard deviation calculations. I like reference [5], but be aware it is technical.

The bottom line

The law of large numbers can be hidden in data; the language used and the data presentation can all confuse what’s going on. You need to be acutely aware of sample size effects: you need to know how to calculate them and how they can manifest themselves in data in surprising ways.

References

[1] Howard Wainer, “The Most Dangerous Equation”, https://www.americanscientist.org/article/the-most-dangerous-equation

[2] Jeremy Orloff, Jonathan Bloom, “Central Limit Theorem and the Law of Large Numbers”, https://math.mit.edu/~dav/05.dir/class6-prep.pdf

[3] Gerald van Belle, "Statistical rules of thumb", http://www.vanbelle.org/struts.htm

[4] Gerald van Belle, "Statistical rules of thumb chapter 2 - sample size", http://www.vanbelle.org/chapters/webchapter2.pdf

[5] Steven Miller, "The probability lifesaver"

Saturday, September 6, 2025

Old & experienced vs. young and energetic: mean age in English football

Which is better, youth or experience?

Professional sports are pretty much a young person's game and English football is no exception; it's rare to see players over 30. One notable example is Mark Howard, a goalkeeper for Wrexham up to 2025, who was 38 at the end of his contract. His advanced age earned him the nickname "Jurassic Mark". He carried on playing as long as he did because his experience gave him an edge.

Given all teams are youthful, is it better to have an older team (guided by experience) or a younger team (the energy of youth)? Which type of team might score more goals? I'm going to explore this issue in this blog post.

(Canva)

The data

I've taken the data for this blog post from TransferMarket.com (https://www.transfermarkt.com/) that has data on the mean age of English football clubs at the start of each season. Obviously, transfers etc. change the mean age, but it's a reasonable place to start.

The charts

Here's a chart showing total goals for, against, and goal difference per season per club per league against mean team age at the start of the season. I've added a linear fit to the data so you can see the trends and I've included a 95% confidence band around the fit. The r² value is in the chart title, as is the p-value.

The charts are interactive, you can:

Zoom in and out of the data using the menu on the left.
Save the charts to disk using the menu on the left.
See the data points values by hovering your mouse over the data points.
Select the league tier using the buttons.
Select the season using the slider.

What the charts show

There's some correlation between goals and mean team age, but it isn't very strong.

For the Premier League, there is a consistent pattern over the years that younger teams do better, but it's a small effect, really something that's second-order at best.

For the lower leagues, again, there's an effect, but it's smaller and less consistent.

One thing that did surprise me was the consistency of the mean age ranges across leagues and across time. I would have thought that lower leagues might have more players towards the end of their careers (slower and cheaper) or possibly more younger players (inexperienced and cheaper) and that might skew the club mean age older or younger. That doesn't seem to be the case. It's possible lower leagues have a different club age makeup from the Premier League, but I can't get at that from this data set.

What does it mean?

A player might have ten years (ages 20-30) in the top flight if they're lucky, which suggests 25 is mid-career for most of them. At some point, they'll have an optimal balance between experience and youth, but that's unlikely to be at the beginning or end. A similar argument might apply to teams as a whole. If there's any truth to this argument, then some form of triangular fit would be better than a straight-linear fit. Even with the linear fit, we can see there is some relationship between goals and mean age, albeit a very weak one.

I'm looking for features that help predict team success. Club mean age seems like it would be a good second-order one.