Lots of things are proportions
In statistics, a proportion is a number that can vary from 0 to 1. Proportions come up all the time in business and here are just a few examples.
- Conversion rates on websites (fraction of visitors who buy something).
- Opinion poll results (e.g. fraction of businesses who think the economy will improve in the next six months).
- Market share.
Estimating the mean and the confidence interval
Estimating the population mean is very straightforward and very obvious. Let's take a simple example to help explain the math. Imagine a town with 38,000 residents and they'll vote on whether the town government should build a new fire station or not. We'll call the actual vote results (proportion in favor of the fire station) the population mean. You want to forecast the results of the vote, so you run a survey, the proportion you get from the survey will be a sample mean. Let's say you survey 500 people (the sample size) and 350 said yes (this is the number of successes). Assuming the survey is unbiased, our best estimate (of the population mean) is give by (the sample mean):
\(\hat{p} = \dfrac{m}{n} = \dfrac{350}{500} = 0.7\)
But how certain are we of this number? If we had surveyed all 38,000 residents, we'd probably get a very, very accurate number, but the cost of the survey goes up with the number of respondents. On the other hand, if we asked 10 residents, our results aren't likely to be accurate. So how many people do we need to ask? Another way of saying this is, how certain are we that our sample mean is close to the population mean?
The textbook approach to answering this question is to use a confidence interval. To greatly simplify, the confidence interval is two numbers (an upper and lower number) between which we think there's a 95% probability the population mean lies. The probability doesn't have to be 95%, but that's the usual choice. The other usual choice is to express the confidence interval relative to the sample mean, so the lower bound is the sample mean minus a value, and the upper bound is the sample mean plus the same value. For our fire station example, we might say something like \(0.7 \pm 0.04\), which is a 4% margin of error.
Here's the formula:
\(\hat{p} \pm z_{\frac{\alpha}{2}} \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}\)
You sometimes hear people call this the Wald interval, named after Abraham Wald. The symbol \(z_{\frac{\alpha}{2}}\) comes from the normal distribution, and for a 95% confidence interval, it's close to 1.96. This formula is an approximation. It's been used for decades because it's easy to use and cheap to calculate, which was important when computations were expensive.
Let's plug some numbers into the Wald formula as an example, To go back to our fire station opinion poll, we can put the numbers in and get a 95% confidence interval. Here's how it works out:
\(0.7 \pm 1.96 \sqrt{\dfrac{0.7(1-0.7)}{500}} = 0.7 \pm 0.04\)
So far so good, but there are problems...
(The actual meaning of the confidence interval is more nuanced and more complicated. If we were to repeat the survey an infinite number of times and generate an infinite number of confidence intervals, then 95% of the confidence intervals would contain the population mean. This definition gets us into the deeper meaning of statistics and is harder to understand, so I've given the usual 'simpler' explanation above. Just be aware that this stuff gets complicated and language matters a lot.)
It all goes wrong at the extremes - and the extremes happen a lot
What most of the textbooks don't tell you is that the formula for the confidence interval is an approximation and that it breaks down:
- when \(\hat{p}\) is close to 0 or 1.
- when n is small.
Unfortunately, in business, we often run into these cases. Let's take look at a conversion rate example. Imagine we run a very short test and find that from 100 website visitors, only 2 converted. We can express our conversion rate as:
\(0.02 \pm 1.96 \sqrt{\dfrac{0.02(1-0.02)}{100}} = 0.02 \pm 0.027\)
Before we go on, stop and look at this result. Can you spot the problem?
Dog lovers = \(0 \pm 1.96 \sqrt{\dfrac{0.0(1-0)}{25}} = 0 \pm 0\)
Cat lovers = \(1 \pm 1.96 \sqrt{\dfrac{1(1-1)}{25}} = 1 \pm 0\)
These results suggest we're 100% sure everyone is cat lover and no-one is a dog lover. Does this really seem sensible to you? Instead of cats and dogs, imagine it's politicians. Even in areas that vote heavily for one party, there are some supporters of other parties. Intuitively, our confidence interval shouldn't be zero.
The Wald interval breaks down because it's based on an approximation. When the approximation no longer holds, you get nonsense results.
In the next section, I'll explain how you can do better.
(I've seen "analysts" with several years' experience argue that these type of results are perfectly fine. They didn't understand the math but they were willing to defend obviously wrong results because it came out of a formula they know. This is really bad for business, Amazon would never make these kinds of mistakes and neither should your business.)
A better alternative #1: Wilson score intervals
The Wilson score interval makes a different set of approximations than the Wald interval, making it more accurate but more complicated to calculate. I'm going to avoid the theory for now an jump straight into the formula:
This is a scary looking formula and it's much harder to implement that the Wald interval, but the good news is, there are several implementations in Python. I'll show you a few, first one in statsmodels and the second one using scipy.
As you might expect, the two methods give the same results.
For the conversion rate sample (100 visitors, 2 purchases), we get the lower interval as 0.0055 and the upper as 0.0700, which is an improvement because the lower bound is above zero. The score interval makes sense.
For the cats and dogs example, we get for dogs: lower=0, upper=0.1332, for cats we get: lower=0.8668, upper=1. This seems much better too. We've allowed for the town to have dog lovers in it which chimes with our intuition.
The Wilson score interval has several neat properties:
- It will never go below 0
- It will never go above 1
- It gives accurate answers when n is small and when \(\hat{p}\) is close to zero or 1.
- The Wald interval will sometimes give you a single value, the Wilson score interval will always give you two (which is what you want).
- The Wilson score interval is close to the Wald interval for large n and where \(\hat{p}\) is close to 0 or 1.
You can read more about the Wilson score interval in this excellent blog post: https://www.econometrics.blog/post/the-wilson-confidence-interval-for-a-proportion/ Take a look at the charts, they show you that the Wilson score interval gives much more accurate results for small n and when \(\hat{p}\) is close to zero or 1.
This reference provides a fuller explanation of the theory: https://www.mwsug.org/proceedings/2008/pharma/MWSUG-2008-P08.pdf
A better alternative #2: Agresti-Coull
The Agresti-Coull is another interval like the Wilson score interval. Again, it's based on a different set of approximations and a very simple idea. The starting point is to take the data and add two success observations and two failure observations. Using the labels I gave you earlier; m is the number of success measurements and n the total number of measurements, then the Agresti-Coull interval uses m + 2 and n + 4. Here's what it looks like in code:
The Agresti-Coull interval is an approximation to the Wilson score interval, so unless there's a computation reason to do something different, you should use the Wilson score interval.
Other alternatives
As well as Wilson and Agresti-Coull, there are a bunch of alternatives, including Clopper-Pearson, Jeffrey (Bayesian), and more. Most libraries have a range of methods you can apply.
What to do
Generally speaking, be sure to know the limitations of all the statistical methods you use and select the right methods for your data. Don't assume that something is safe to use because "everyone" is using it. Occasionally, the methods you use will flag up junk results (e.g. implying a negative conversion rate). If this happens to you, it should be a sign that your algorithms have broken down and that it's time to go back to theory.
For proportions, if your proportion mean is "close" to 0.5 and your sample size is large (say, over 100), use the Wald interval. Otherwise, use the Wilson score interval. If you have to use one and only one method, use the Wilson score interval.