What is truth?
Statistical testing is ultimately all about probabilities and thresholds for believing an effect is there or not. These thresholds and associated ideas are crucial to the decision-making process but are widely misunderstood and misapplied. In this blog post, I'm going to talk about three testing concepts: confidence, significance, and p-values; I'll deal with the hugely important topic of statistical power in a later post.
Types of error
To simplify, there are two kinds of errors in statistical testing:
- Type I - false positive. You say there's an effect when there isn't. This is decided by a threshold \(\alpha\), usually set to 5%. \(\alpha\) is called significance.
- Type II - false negative. You say there isn't an effect but there is. This is decided by a threshold \(\beta\) but is usually expressed as the statistical power which is \(1 - \beta\).
In this blog post, I'm going to talk about the first kind of error, Type I.
Distribution and significance
Let's imagine you're running an A/B test on a website and you're measuring conversion on the A branch (\(c_A\)) and on the B branch (\(c_B\)). The null hypothesis is that there's no effect, which we can write as:
\[H_0: c_A - c_B = 0\]
This next piece is a little technical but bear with me. Tests of conversion are usually large tests (mostly, > 10,000 samples in practice). The conversion rate is the mean conversion for all website visitors. Because there are a large number of samples, and we're measuring a mean, the Central Limit Theorem (CLT) applies, which means the mean conversion rates will be normally distributed. By extension from the CLT, the quantity \( c_A - c_B\) will also be normally distributed. If we could take many measurements of \( c_A - c_B\) and the null hypothesis was true, we would theoretically expect the results to look like something like this.
Look closely at the chart. Although I've cut off the x-axis, the values go off to \(\pm \infty\). If all values of \( c_A - c_B\) are possible, how can we reject the null hypothesis and say there's an effect?
Significance - \(\alpha\)
To decide if there's an effect there, we use a threshold. This threshold is referred to as the level of significance and is called \(\alpha\). It's usually set at the 0.05 or 5% level. Confusingly, sometimes people refer to confidence instead, which is 1 - significance, so a 5% significance level corresponds to a 95% confidence level.
In the chart below, I've colored the 95% region around the mean value blue and the 5% region (2.5% at each end) red. The blue region is called the acceptance region and the red region is called the rejection region.
What we do is compare the measurement we actually make with the chart. If our measurement lands in the red zone, we decide there's an effect there (reject the null), if our measurement lands in the blue zone, we'll decide there isn't an effect there (fail to reject the null or 'accept' the null).
One-sided or two-sided tests
On the chart with the blue and the red region, there are two rejection (red) regions. This means we'll reject the null hypothesis if we get a value that's more than a threshold above or below our null value. In most tests, this is what we want; we're trying to detect if there's an effect there or not and the effect can be positive or negative. This is called a two-sided test because we can reject the null in two ways (too negative, too positive).
But sometimes, we only want to detect if the treatment effect is bigger than the control. This is called a one-sided test. Technically, the null hypothesis in this case is:
\[H_0: c_A - c_B \leq 0\]
Graphically, it looks like this:
So we'll reject the null hypothesis only if our measured value lands in the red region on the right. Because there's only one rejection region and it's on one side of the chart, we call this a one-sided test.
p-values
I've very blithely talked about measured values landing in the rejection region or not. In practice, that's not what we do; in practice, we use p-values.
Let's say we measured some value x. What's the probability we would measure this value if the null hypothesis were true (in other words, if there were no effect)? Technically, zero because the distribution is continuous, but that isn't helpful. Let's try a more helpful form of words. Assuming the null hypothesis is true, what's the probability we would see a value of x or a more extreme value? Graphically, this looks something like the green area on the chart below.
Let me say again what the p-value is. If there's no effect at all, it's the probability we would see the result (or a more extreme result) that we measured, or how likely is it that our measurement could have been due to chance alone?
The chart below shows the probability the null hypothesis is true; it shows the acceptance region (blue), the rejection region (red), and the measurement p-value (green). The p-value is in the rejection region, so we'll reject the null hypothesis in this case. If the green overlapped with the blue region, we would accept (fail to reject) the null hypothesis.
Misunderstandings
There are some common misunderstandings around testing that can have quite bad commercial effects.
- 95% confidence is too high a bar - we should drop the threshold to 90%. In effect, this means you'll accept a lot of changes that have no effect. This will reduce the overall effectiveness of your testing program (see this prior blog post for an explanation).
- One-sided tests are OK and give a smaller sample size, so we should use them. This is true, but it's often important to determine if a change is having a negative effect. In general, hypothesis testing tests a single hypothesis, but sadly, people try and read more into test results than they should and want to answer several questions with a single question.
- p-values represent the probability of an effect being present. This is just not true at all.
- A small p-value indicates a big effect. p-values do not make any indication about the size of an effect; a low p-value does not mean there's a big effect.
Practical tensions
In practice, there can be considerable tension between business and technical people over statistical tests. A lot of statistical practices (e.g. 5% significance levels, two-sided testing) are based on experience built up over a long time. Unfortunately, this all sounds very academic to the business person who needs results now and wants to take shortcuts. Sadly, in the long run, shortcuts always catch you up. There's an old saying that's very true: "there ain't no such thing as a free lunch."