Showing posts with label confidence interval. Show all posts
Showing posts with label confidence interval. Show all posts

Monday, April 14, 2025

Why a lot of confidence intervals are wrong

Lots of things are proportions

In statistics, a proportion is a number that can vary from 0 to 1. Proportions come up all the time in business and here are just a few examples.

  • Conversion rates on websites (fraction of visitors who buy something).
  • Opinion poll results (e.g. fraction of businesses who think the economy will improve in the next six months).
  • Market share.
If you can show something meaningful on a pie chart, it's probably a proportion.

(Amousey, CC0, via Wikimedia Commons)

Often, these proportions are quoted with a confidence interval or margin of error, so you hear statements like "42% said they would vote for X and 44% for Y. The survey had a 3% margin of error". In this blog post, I'm going to show you why the confidence interval, or margin of error, can be very wrong in some cases.

We're going to deal with estimates of the actual mean. In many cases, we don't actually know the true (population) mean, we're estimating based on a sample. The mean of our sample is our best guess at the population mean and the confidence interval gives us an indication of how confident we are in our estimate of the mean. But as we'll see, the usual calculation of confidence interval can go very wrong.

We're going to start with some text book math, then I'm going to show you when it goes badly astray, then we're going to deal with a more meaningful way forward.

Estimating the mean and the confidence interval

Estimating the population mean is very straightforward and very obvious. Let's take a simple example to help explain the math. Imagine a town with 38,000 residents and they'll vote on whether the town government should build a new fire station or not. We'll call the actual vote results (proportion in favor of the fire station) the population mean. You want to forecast the results of the vote, so you run a survey, the proportion you get from the survey will be a sample mean. Let's say you survey 500 people (the sample size) and 350 said yes (this is the number of successes). Assuming the survey is unbiased, our best estimate (of the population mean) is give by (the sample mean):

\(\hat{p} = \dfrac{m}{n} = \dfrac{350}{500} = 0.7\)

But how certain are we of this number? If we had surveyed all 38,000 residents, we'd probably get a very, very accurate number, but the cost of the survey goes up with the number of respondents. On the other hand, if we asked 10 residents, our results aren't likely to be accurate. So how many people do we need to ask? Another way of saying this is, how certain are we that our sample mean is close to the population mean?

The textbook approach to answering this question is to use a confidence interval. To greatly simplify, the confidence interval is two numbers (an upper and lower number) between which we think there's a 95% probability the population mean lies. The probability doesn't have to be 95%, but that's the usual choice. The other usual choice is to express the confidence interval relative to the sample mean, so the lower bound is the sample mean minus a value, and the upper bound is the sample mean plus the same value. For our fire station example, we might say something like \(0.7 \pm 0.04\), which is a 4% margin of error.

Here's the formula:

\(\hat{p} \pm z_{\frac{\alpha}{2}} \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}\)

You sometimes hear people call this the Wald interval, named after Abraham Wald. The symbol \(z_{\frac{\alpha}{2}}\) comes from the normal distribution, and for a 95% confidence interval, it's close to 1.96. This formula is an approximation. It's been used for decades because it's easy to use and cheap to calculate, which was important when computations were expensive.

Let's plug some numbers into the Wald formula as an example, To go back to our fire station opinion poll, we can put the numbers in and get a 95% confidence interval. Here's how it works out:

\(0.7 \pm 1.96 \sqrt{\dfrac{0.7(1-0.7)}{500}} = 0.7 \pm 0.04\)

So we think our survey is pretty accurate, we're 95% sure the real mean is between 0.66 and 0.74. This is exactly the calculation people use for opinion polls, in our case, our margin of error is 4%.

So far so good, but there are problems...

(The actual meaning of the confidence interval is more nuanced and more complicated. If we were to repeat the survey an infinite number of times and generate an infinite number of confidence intervals, then 95% of the confidence intervals would contain the population mean. This definition gets us into the deeper meaning of statistics and is harder to understand, so I've given the usual 'simpler' explanation above. Just be aware that this stuff gets complicated and language matters a lot.) 

It all goes wrong at the extremes - and the extremes happen a lot

What most of the textbooks don't tell you is that the formula for the confidence interval is an approximation and that it breaks down:

  • when \(\hat{p}\) is close to 0 or 1.
  • when n is small.

Unfortunately, in business, we often run into these cases. Let's take look at a conversion rate example. Imagine we run a very short test and find that from 100 website visitors, only 2 converted. We can express our conversion rate as:

\(0.02 \pm 1.96 \sqrt{\dfrac{0.02(1-0.02)}{100}} = 0.02 \pm 0.027\)

Before we go on, stop and look at this result. Can you spot the problem?

The confidence interval goes from -0.007 to  0.047. In other words, we're saying there's a probability the conversion rate can be negative. This is plainly absurd.

Let's take another example. Imagine we want to know the proportion of dog lovers there are in a town of cat lovers. We ask 25 people do they love cats or dogs and 25 of them said cats. Here's our estimate of the proportion of cat lovers and dog lovers:

Dog lovers = \(0 \pm 1.96 \sqrt{\dfrac{0.0(1-0)}{25}} = 0 \pm 0\)

Cat lovers = \(1 \pm 1.96 \sqrt{\dfrac{1(1-1)}{25}} = 1 \pm 0\)

These results suggest we're 100% sure everyone is cat lover and no-one is a dog lover. Does this really seem sensible to you? Instead of cats and dogs, imagine it's politicians. Even in areas that vote heavily for one party, there are some supporters of other parties. Intuitively, our confidence interval shouldn't be zero.

The Wald interval breaks down because it's based on an approximation. When the approximation no longer holds, you get nonsense results. 

In the next section, I'll explain how you can do better.

(I've seen "analysts" with several years' experience argue that these type of results are perfectly fine. They didn't understand the math but they were willing to defend obviously wrong results because it came out of a formula they know. This is really bad for business, Amazon would never make these kinds of mistakes and neither should your business.)

A better alternative #1: Wilson score intervals

The Wilson score interval makes a different set of approximations than the Wald interval, making it more accurate but more complicated to calculate. I'm going to avoid the theory for now an jump straight into the formula:

\(  \dfrac{\hat{p}  + \dfrac{z^2_{\frac{\alpha}{a}}}{2n} \pm z_{\frac{\alpha}{2}} \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n} + \dfrac{z^2_{\frac{\alpha}{2}}}{4n^2}}}{1 + \frac{z^2_{\frac{\alpha}{2}}}{n}}\)

This is a scary looking formula and it's much harder to implement that the Wald interval, but the good news is, there are several implementations in Python. I'll show you a few, first one in statsmodels and the second one using scipy.

import numpy as np
from statsmodels.stats.proportion import proportion_confint
from scipy import stats

# Sample data
n = 100 # number of observations
k = 2 # number of successes

# Calculate Wilson score interval using statsmodels
wilson_ci = proportion_confint(k, n, alpha=0.05, method='wilson')
print("Wilson Score Interval (statsmodels):")
print(f"Lower bound: {wilson_ci[0]:.4f}")
print(f"Upper bound: {wilson_ci[1]:.4f}")

# Calculate Wilson score interval using scipy
# Wilson score interval formula implementation
wilson_ci_scipy = stats.binomtest(2, 100).proportion_ci(method='wilson')
print("\nWilson Score Interval (scipy):")
print(f"Lower bound: {wilson_ci_scipy[0]:.4f}")
print(f"Upper bound: {wilson_ci_scipy[1]:.4f}")

As you might expect, the two methods give the same results. 

For the conversion rate sample (100 visitors, 2 purchases), we get the lower interval as 0.0055 and the upper as 0.0700, which is an improvement because the lower bound is above zero.  The score interval makes sense.

For the cats and dogs example, we get for dogs: lower=0, upper=0.1332, for cats we get: lower=0.8668, upper=1. This seems much better too. We've allowed for the town to have dog lovers in it which chimes with our intuition.

The Wilson score interval has several neat properties:

  • It will never go below 0
  • It will never go above 1
  • It gives accurate answers when n is small and when \(\hat{p}\) is close to zero or 1.
  • The Wald interval will sometimes give you a single value, the Wilson score interval will always give you two (which is what you want).
  • The Wilson score interval is close to the Wald interval for large n and where \(\hat{p}\) is close to 0 or 1. 

You can read more about the Wilson score interval in this excellent blog post: https://www.econometrics.blog/post/the-wilson-confidence-interval-for-a-proportion/ Take a look at the charts, they show you that the Wilson score interval gives much more accurate results for small n and when \(\hat{p}\) is close to zero or 1.

This reference provides a fuller explanation of the theory: https://www.mwsug.org/proceedings/2008/pharma/MWSUG-2008-P08.pdf

A better alternative #2: Agresti-Coull

The Agresti-Coull is another interval like the Wilson score interval. Again, it's based on a different set of approximations and a very simple idea. The starting point is to take the data and add two success observations and two failure observations. Using the labels I gave you earlier; m is the number of success measurements and n the total number of measurements, then the Agresti-Coull interval uses m + 2 and n + 4. Here's what it looks like in code:

# Calculate Agresti-Coull score interval using statsmodels
ag_ci = proportion_confint(k, n, alpha=0.05, method='agresti_coull')
print(f"Agrest-Coulllson Score Interval (statsmodels):")
print(f"Lower bound: {ag_ci[0]:.4f}")
print(f"Upper bound: {ag_ci[1]:.4f}")

The Agresti-Coull interval is an approximation to the Wilson score interval, so unless there's a computation reason to do something different, you should use the Wilson score interval.

Other alternatives

As well as Wilson and Agresti-Coull, there are a bunch of alternatives, including Clopper-Pearson, Jeffrey (Bayesian), and more. Most libraries have a range of methods you can apply.

What to do

Generally speaking, be sure to know the limitations of all the statistical methods you use and select the right methods for your data. Don't assume that something is safe to use because "everyone" is using it. Occasionally, the methods you use will flag up junk results (e.g. implying a negative conversion rate). If this happens to you, it should be a sign that your algorithms have broken down and that it's time to go back to theory.

For proportions, if your proportion mean is "close" to 0.5 and your sample size is large (say, over 100), use the Wald interval. Otherwise, use the Wilson score interval. If you have to use one and only one method, use the Wilson score interval.