How many samples of labeled data do you need?
It turns out, finding out how many labeled samples you need to “correctly” build a supervised machine learning (ML) model is a hard question with no clear answer. In his blog post, I’m going to run through the issues and finish with some advice for people managing ML model building.
(Canva)
Why does it matter?
Sample size plays into two big related themes for ML models:
- Correctness. This means how correctly your model predicts results at a point in time.
- Reliability. This means how correctly your model works over time.
Small sample sizes tend to give models that have a lower correctness and that give worse performance over time. This is all tied up with variance and the “law of small numbers”.
Let’s say your manager comes to you and asks you to build a ML model on a data set. When do you express concern at the size of the data set? When it’s 10, 100, 1000, 10000, or 100000 samples? What happens if your manager asks you to justify your concern?
For a correct, stable model, you typically need a “big enough” data set to train with, but how much is “big enough”?
What does sample size mean?
Before I dive into this some more, I should define what I mean by sample size. I mean the size of the labeled data set used for training a supervised machine learning model excluding cross-validation and hold out data sets. For example, if you use 20% of your data for hold outs, and 80% of your cross-validation data is training, only 0.8*0.8 = 0.64 of your data counts towards sample size.
Why is this a hard problem?
There’s very little in the literature, there’s almost nothing in the leading books on machine learning, and it’s only mentioned in passing on machine learning courses. It’s an area of active research, which means there’s nothing packaged for easy use.
I’ve spent hours searching for and reading papers but I’ve not found anything useful. What I did find is that the field that’s most advanced is medicine. Researchers are increasingly using ML models for clinical trials and they need to know how many patients to enroll in their trials. It seems that they’re mostly using statistical tests (see below) for sample size however, some researchers are trying to develop robust statistical methods to independently estimate sample size. As of June 2025, there’s no consensus on the best approach.
What do other disciplines do?
In frequentist statistics, there’s a recipe for determining sample size given significance, power, and effect size for a single comparison test (formally, a null-hypothesis test). The code exists in R and Python libraries, so all you have to do is put the numbers into a formula and you get your minimum sample size. Everyone doing randomized control tests (RCTs, AKA A/B tests) works out sample size before running a test.
The nearest statistical equivalent to ML is multi-comparison null-hypothesis tests, which is really something different, but it does give you some idea of sample size. The math is more complex and most people use something called the Bonferroni correction to go from single comparison to multi-comparison testing. To give you an idea of numbers, the table below shows the minimum sample size for a proportion z-test with a significance level of 5%, a power of 85%, a baseline proportion of 5%, and a 5% effect size, with Bonferroni correction.
Comparisons | Sample size |
---|---|
1 | 272,944 |
2 | 409,416 |
3 | 545,888 |
4 | 682,360 |
5 | 818,832 |
... |
Two things here: the sample size starts at 272,944, it goes up for each test you add. If this is any kind of guide, we’re looking at ML sample sizes into the high hundreds of thousands.
Notably, the sample size for a null-hypothesis test depends on the effect size; a big effect leads to smaller tests. This is why most drug trials have sample sizes in the low hundreds, the effect they’re looking for is large. Conversely, in retail, effect sizes can be small leading to sample sizes in the high hundreds of thousands or even millions. This might be an important clue for ML sample sizes.
What rules of thumb are there?
The general consensus is, if you have n samples and f features, then n >> f. I’ve heard people talk about a 50x, 100x, or 1,000x ratio as being minimal. So, if you have 5 features, you need a minimum of 250-5000 samples. But even this crude figure might not be enough depending on the model.
What do people do in practice?
I’ve never come across a data scientist who estimates needed sample size. People use the cost function instead: if the cost function is “good enough” this suggests the sample size is good enough too. There are variations on this with people using confusion matrices, precision-recall, etc. etc. as “proxies”; if the metric is good enough the sample size is good enough.
But relying on the cost function or metrics alone isn’t enough. I’ve seen people develop models using under a hundred samples with over five features. The cost function results were OK, but as you might expect, the model wasn’t very robust and gave poor results some of the time.
Let me draw a comparison with an RCT to evaluate a new drug. All trials have an initial estimate of the sample size needed, but let’s say they didn’t and relied on metrics (e.g., fraction of patients cured). Do you think this would be OK, would you take the drug? Would you take the drug if the sample size was 10, 100, or 1000 patients? Or would you prefer there to be a robust estimate of the sample size?
My recommendations
The situation isn’t very satisfactory. Frequentists statistics suggests hundreds of thousands of samples which looks very different from the 50x-1,000x rule of thumb. Even the 50x-1,000x rule of thumb gives a huge range of answers. Using the cost function or metrics alone doesn’t feel very safe either.
I’m not in a position to give a robust statistical recipe to calculate sample size. All I can do is offer some advice. Take it for what it's worth.
- Ideally, have a sample size of at least 100,000, but make sure you have at least 1,000x as much data as you have features. If you really have to model with less than 100,000 samples, recognize you're on very slippery ground.
- Run a feature importance analysis. If you have many features each with a small contribution, that’s a warning sign; you should consider increasing your sample size.
- Regularly performance check your model and have pre-determined thresholds for taking action.
No comments:
Post a Comment