We're not reading the data right

In the real world, we’re under pressure to get results from data analysis. Sometimes, the pressure to deliver certainty means we forget some of the basics of analysis. In this blog post, I’m going to talk about one pitfall you can make which can cause you to give wildly wrong answers. I’ll start with an example.

School size - smaller schools are better?

You’ve probably heard the statement that “small schools produce better results than large schools”. Small-school advocates point out that small schools disproportionately appear in the top performing groups in an area. It sounds like small schools are the way to go, or are they? It’s also true that small schools disproportionately appear among the worse schools in an area. So, which is it, are small schools better or worse?

The answer is: both. Small schools have a higher variation in results because they have fewer students. The results are largely due to “statistical noise” [1].

We can easily see the effects of sample size “statistical noise”, more properly called variance, in a very simple example. Imagine tossing a coin and scoring heads a 1 and tails a 0. You would expect the mean over many tosses to be close to 0.5, but how many tosses do you have to do? I wrote a simple program to simulate tossing a coin and I summed up the results as a I went along. The charts below show four simulations. The x axis of each chart is the number of tosses, the y axis is the running mean, the blue line is the simulation, and the red dotted line is 0.5.

The charts clearly show higher variance at low numbers of simulations. It takes a surprisingly large number of tosses for the mean to get close top 0.5. If we want more certainty, and less variance, we need bigger samples sizes.

We can repeat the experiment, but this time with a six-sided dice and record the running mean. We’d see the same result, more variance for shorter simulations. Let’s try a more interesting example (you’ll see why in a minute). Let’s imagine a 100-sided dice and run the experiment multiple times , recording the mean results after each simulation (I’ve shown a few runs here).

Let’s change the terminology a bit here. The 100-sided dice is a percentage test result. Each student rolls the dice. If there are 100 students in a school, there are 100 die rolls, if there are 1,500 students in the school, we roll the die 1,500 times. We now have a simulation of school test results and the effect of school size.

I simulated 500 schools with 500 to 1,500 students. Here are the results.

As you can see, there’s more variance for shorter smaller schools than larger schools. This neatly explains why smaller schools are both the best in an area and the worst.

You might object to the simplicity of my analysis, surely real school results don't look like this.What does real-world data show? Wainer [1] did the work and got the real results (read his paper for more detials). Here's a screen shot from his paper showing real-world school results. It looks a lot like my simple-minded simulation.

Sample size variation is not the full explanation for school results, but it is a factor. Any analysis has to take it into account. Problems occur because of simple (wrong) analysis and overly-simple conclusions.

The law of large numbers

The effect that variance goes down with increasing sample size is known as the law of large numbers. It’s widely taught and there’s a lot written about it online. Unfortunately, most of the discussions get lost in the weeds very quickly. These two references do a very good job of explaining what’s going on: [1] [2].

The law of large numbers has a substantial body of mathematical theory behind it. It has an informal counter-part, that's a bit easier to understand, called the law of small numbers that says that there’s more variance in smaller samples than large ones. Problems occur because people assume that small samples behave in the same way as larger samples (small school results have the same variance as large school results for example).

So far, this sounds simple and obvious, but in reality, most data analysts aren’t fully aware of the effect of sample size. It doesn’t help that the language used in the real-world doesn’t conform to the language used in the classroom.

Small sales territories are the best?

Let’s imagine you were given some sales data on rep performance for an American company and you were asked to find factors that led to better performance.

Most territories have about 15-20 reps, with a handful having 5 or less reps. The top 10 leader board for the end of the year shows you that the reps from the smaller territories are doing disproportionally well.The sales VP is considering changing her sales organization to create smaller territories and she wants you to confirm what she’s seen in the data. Should she re-organize to smaller territories to get better results?

Obviously, I’ve prepped you with answer, but if I hadn’t, would you have concluded smaller territories are the way to go?

Rural lives are healthier

Now imagine you’re an analyst in a health insurance company in the US. You’ve come across data on the prevalence on kidney cancer by US county. You’ve found that the lowest prevalence is in rural counties. Should you set company policy based on this data? It seems obvious that the rural lifestyle is healthier. Should health insurance premiums include a rural/urban cost difference?

I’ve taken this example from the paper by Wainer [1]. As you might have guessed, rural counties have both the lowest and the highest rates of kidney cancer because their populations are small, so the law of small numbers kicks in. I’ve reproduced Wainer’s chart here: the x axis is county population and the y-axis is cancer rate, see his paper for more about the chart. It’s a really great example of the effect of sample size on variance.

A/B test hell

Let’s take a more subtle example. You’re running an A/B test that’s inconclusive. The results are really important to the company. The CMO is telling everyone that all the company needs to do is run the test for a bit longer. You are the analyst and you’ve been asked if running more tests is the solution. What do you say?

The only time it's worth running the test a bit longer is if the test is on the verge of significance. Other than that, it's probably not worth it. Belle's book [3] has a nice chapter on sample size calculations you can access for free online [4]. The bottom line is, the smaller the effect, the larger the sample size you need for significance. The relationship isn't linear. I've seen A/B tests that would have to run for over a year to reach significance.

Surprisingly, I've seen analysts who don't know how to do a sample size/duration estimate for an A/B test. That really isn't a good place to be when the business is relying on you for answers,

The missing math

Because I’m aiming for a more general audience, I’ve been careful here not to include equations. If you’re an analyst, you need to know:

What variance is and how to calculate it.
How sample size can affect results - you need to look for it everywhere.
How to estimate how much of what you're seeing is due to sample size effects and how much due to something "real".

Unfortunately, references for the law of large numbers get overly technical overly quickly. A good place to start is references that cover variance and standard deviation calculations. I like reference [5], but be aware it is technical.

The bottom line

The law of large numbers can be hidden in data; the language used and the data presentation can all confuse what’s going on. You need to be acutely aware of sample size effects: you need to know how to calculate them and how they can manifest themselves in data in surprising ways.

References

[1] Howard Wainer, “The Most Dangerous Equation”, https://www.americanscientist.org/article/the-most-dangerous-equation

[2] Jeremy Orloff, Jonathan Bloom, “Central Limit Theorem and the Law of Large Numbers”, https://math.mit.edu/~dav/05.dir/class6-prep.pdf

[3] Gerald van Belle, "Statistical rules of thumb", http://www.vanbelle.org/struts.htm

[4] Gerald van Belle, "Statistical rules of thumb chapter 2 - sample size", http://www.vanbelle.org/chapters/webchapter2.pdf

[5] Steven Miller, "The probability lifesaver"

Engora Data Blog

Thursday, November 6, 2025

How to get data analysis very wrong: sample size effects