is if there's an effect
In hypothesis testing, there are two kinds of errors:
- Type I - we say there's an effect when there isn't. The threshold here is
. - Type II - we say there's no effect when there really is an effect. The threshold here is
.
This blog post is all about explaining and calculating
The null hypothesis
Let's say we do an A/B test to measure the effect of a change to a website. Our control branch is the A branch and the treatment branch is the B branch. We're going to measure the conversion rate
there is no difference between the branches there is a difference between the branches
Remember, we don't know if there really is an effect, we're using procedures to make our best guess about whether there is an effect or not, but we could be wrong. We can say there is an effect when there isn't (Type I error) or we can say there is no effect when there is (Type II error).
Mathematically, we're taking the mean of thousands of samples so the central limit theorem (CLT) applies and we expect the quantity
in a picture
Let's assume there is no effect. We can plot out our expected probability distribution and define an acceptance region (blue, 95% of the distribution) and two rejection regions (red, 5% of the distribution). If our measured
One way of looking at the blue area is to think of it as a confidence interval around the mean
In this equation, s is the standard error in our measurement. The probability of a measurement
If we transform our measurement
in a picture
Now let's assume there is an effect. How likely is it that we'll say there's no effect when there really is an effect? This is the threshold
To draw this in pictures, I want to take a step back. We have two hypotheses:
there is no difference between the branches there is a difference between the branches
We can draw a distribution for each of these hypotheses. Only one distribution will apply, but we don't know which one.
If the null hypothesis is true, the blue region is where our true negatives lie and the red region is where the false positives lie. The boundaries of the red/blue regions are set by
If the alternate hypothesis is true, the true positives will be in the green region and the false negatives will be in the orange region. The boundary of the green/orange regions is set by
Calculating
Calculating
To calculate
Let's take an example so I can show you the process step by step.
- Assuming the null hypothesis, set up the boundaries of the acceptance and rejection region. Assuming a 95% acceptance region and an estimated mean of x, this gives the acceptance region as:
which is the mean and 95% confidence interval for the null hypothesis. Our measurement must lie between these bounds. - Now assume the alternate hypothesis is true. If the alternate hypothesis is true, then our mean is
. - We're still using this equation from before, but this time, our distribution is the alternate hypothesis.
- Transforming to the standard normal distribution using the formula
, we can write the probability as:
This time, let's put some numbers in.
(100,000 per branch) - the null hypothesis - the alternate hypothesis - this comes from combining the standard errors of both branches, so , and I'm using the usual formula for the standard error of a proportion, for example,
Plugging them all in, this gives:
which gives
This is too hard
This process is complex and involves lots of steps. In my view, it's too complex. It feels to me that there must be an easier way of constructing tests. Bayesian statistics holds out the hope for a simpler approach, but widespread adoption of Bayesian statistics is probably a generation or two away. We're stuck with an overly complex process using very difficult language.