What's null hypothesis testing?
In business, as in many other fields, we have to make decisions in the face of uncertainty. Does this technology improve conversion? Is the new sales process working? Is the new machine tool improving quality? Almost never are the answers to these questions absolutely certain; there will be probabilities we have to trade off to make our decision.
Null hypothesis tests are a set of techniques that enable us to reach probabilistic conclusions in an unbiased way. They provide a level playing field to decide if an effect is there or not.
Although null hypothesis tests are widely taught in statistics classes, many people who've come into data science from other disciplines aren't familiar with the core ideas. Similarly, people with business backgrounds sometimes end up evaluating A/B tests where the correct interpretation of null hypothesis tests is critical to understanding what's going on.
I’m going to explain to you what null hypothesis testing is and some of the concepts needed to implement and understand it.
What result are you testing for?
To put it simply, a null hypothesis test is a test of whether there is an effect of a certain size present or not. The null hypothesis is that there is no effect, and the alternate hypothesis is that there is an effect.
At its heart, the test is about probability and not certainty. We can’t say for sure if there is an effect or not, what we can say is the probability of there being an effect. But probabilities are limited and we have to make binary go/no-go decisions - so null hypothesis tests include the idea of probability thresholds for deciding whether something is there or not.
To illustrate the use of a null hypothesis test, I’m going to use a famous example, that of the lady tasting tea.
In a research lab, there was a woman who claimed she could tell the difference between cups of tea prepared in one of two ways:
- The milk poured into the cup first and then the tea poured in
- The tea poured first and then the milk poured in.
The researcher decided to do a test of her abilities by asking her to taste multiple cups of tea and state how she thought each cup had been prepared. Of course, it’s possible she could be 100% successful by chance alone.
We can set up a null hypothesis test using these hypotheses:
- The null hypothesis is the most conservative option. Here it’s that she can’t taste the difference. More specifically, her success rate is indistinguishable from random chance.
- The alternative hypothesis is that she can tell the difference. More specifically, her success rate is significantly different from random chance.
- \( p_T \) - the proportion of cups of tea she correctly got
- \( p_C \) - the proportion of cups of tea she would be expected to get by chance alone (by guessing)
- \( H_0: p_T = p_C\)
- \( H_1: p_T \neq p_C\)
But – the hypotheses in this form aren't enough. Will we insist she has to be correct every single time? Is there some threshold we expect her to reach before we accept her claim?
The null hypothesis is the first step in setting up a statistical test, but to make it useful, we have to go a step further and set up thresholds. To do this, we have to understand different types of errors.
To make things easy, we’ll call 'milk first' a positive and 'milk second' a negative.
For our lady testing tea, there are four possibilities:
- She can say ‘milk first’ when it was 'milk first' – a true positive
- She can say ‘milk first’ when it wasn’t 'milk first' – a false positive (also known as a Type I error)
- She can say ‘milk second’ when it was 'milk second' – a true negative
- She can say ‘milk second’ when it wasn’t 'milk second' – a false negative (also known as a Type II error)
This is usually expressed as a table like the one below.
|Null Hypothesis is|
|Decision about null hypothesis||Fail to reject||True negative
Probability threshold= 1 - \( \alpha \)
Type II error
Probability threshold= \( \beta \)
Type I error
Probability threshold = \( \alpha \)
Probability threshold = Power = 1 - \( \beta \)
We can assign probabilities to each of these outcomes. As you can see, there are two numbers that are important here, \(\alpha\) and \(\beta\); however, in practice, we consider \(\alpha\) and 1-\(\beta\) as the numbers of importance. \(\alpha\) is called significance, and 1-\(\beta\) is called power. We can set values for each of them prior to the test. By convention, \(\alpha\) is usually 0.05, and 1-\(\beta \geq \) 0.80.
Test results, test size, and p-values
Our lady could guess correctly by chance alone. We have to set up the test so a positive conclusion due to randomness is unlikely, hence the use of thresholds. The easiest way to do this is to set the test size correctly, i.e. set the number of cups of tea. Through some math I won't go into, we can use \(\alpha\), (1-\(\beta\)), and the effect size to set the sample size. The effect size, in this case, is her ability to detect how the cup of tea was prepared above and beyond what would be expected by chance. For example, we might run a test to see if she was 20% better than chance.
To evaluate the test, we calculate a p-value from the test results. The p-value is the probability the test result was due to chance alone. Because this is so important, I'm going to explain it again using other words. Let's imagine the lady tasting tea was guessing. By guessing alone, she could get between 0% and 100% correct. We know the probability for each percentage. We know it's very unlikely she'll get 100% or 0% by guesswork, but more likely she'll get 50%. For the score she got, we can work out the probability of her getting this score (or higher) through chance alone. Let's say there was a 3% chance she could have gotten her score by guessing alone. Is this proof she's not guessing?
We compare the p-value to our \( \alpha\) threshold to decide which hypothesis is wrong. Let’s say our p-value was 0.03 and our \( \alpha \) value was 0.05, because 0.03 < 0.05 we reject the null hypothesis. In other words, we would accept that the lady was not guessing.
False negatives, false positives
Using \(\alpha\) and a p-value, we can work out the chance of us saying there's an effect when there is none (a false positive). But what about a false negative? We could say there's no effect when there really is one. That might be as damaging to a business as a false positive. The quantity \(\beta\) gives us the probability of a false negative. By convention, statisticians talk about the power (1-\(\beta\)) of a test which is the probability of detecting an effect of the size you think is there.
Single tail or two-tail tests
Technically, the way the null hypothesis is set up in the case of the lady tasting tea is a two-tailed test. To ‘succeed ’ she has to do a lot better than chance or she has to do a lot worse. That’s appropriate in this case because we’re trying to understand if she’s doing something else other than guessing.
We could set up the test differently so she has to only be right more often than chance suggests. This would be a one-tail test. One-tail tests are shorter than two-tail tests, but they’re more limited.
In business, we tend to do two-tailed tests rather than one-tailed tests.
Fail to reject the null or rejecting the null
Remember, we’re talking about probabilities and not certainties. Even if we gave our lady 100 cups to taste, there’s still a possibility she gets them all right due to chance alone. So we can’t say either the null or the alternate is true, all we can do is reject them at some threshold, or fail to reject them. In the case of a p-value of 0.03, a statistician wouldn’t say the alternate is true (the lady can taste the difference), but they would say ‘we reject the null hypothesis’. If the p-value was 0.1, it would be higher than the \( \alpha \) value and we would ‘fail to reject the null hypothesis’. This language is complex, but statisticians are trying to capture the idea that results are about probabilities, not certainties.
Choice of significance and power
Significance and power affect test size, so maybe we should choose them to make the test short? If you want to do a valid test, you're not free to choose any values of \(\alpha\) and (1-\(\beta\)) you choose. Convention dictates that you stick to these ranges:
- \(\alpha \geq 0.95\) - anything less than this is usually considered a junk test.
- (1-\(\beta) \geq 0.8\) - anything less than this is not worth doing.
The why behind these values is the subject of another blog post.
The null hypothesis test summarized
This has been a very high-level summary of what happens in a null hypothesis test, for the sake of simplicity there are several steps I've left out and I've greatly summarized some ideas. Here's a simple summary of the steps I've discussed.
- Decide if the test is one-tail or two-tail.
- Create a null and alternate hypothesis.
- Set values for \(\alpha\) and (1-\(\beta\)) prior to the test.
- After the test, calculate a p-value.
- Compare the p-value to \(\alpha\) to figure out a false positive probability
- Check \(\beta\) to figure out the probability of a false negative.
I've left out topics like the z-test and the t-test and a bunch of other important ideas.
Your takeaway should be that this process is complex and there are no shortcuts. At its heart, hypothesis testing is about deciding what's true when the data is uncertain and you need to do it without bias.
Problems with the null hypothesis test
Mathematically, there's controversy about the fundamentals of the procedure, but frankly, the controversy is too complex to discuss here - in any case, the controversy isn't over whether the procedures work or not.
A more serious problem is baked into the approach. At its heart, null hypothesis testing is about making a binary yes/no decision based on probabilistic data. The results are never certain. Unfortunately, test results are often taken as certain. For example, if we can't detect an effect in a test, it's often assumed there is no effect, but that's not true. This assumption that no detection = no effect has had tragic consequences in medical trials; there are high-profile cases where the negative side effects of a drug have been just below the threshold levels. Sadly, once the drugs have been released, the negative effects become well know with disastrous consequences, a good example being Vioxx.
You must be aware that a test failure doesn't mean there isn't an effect. It could mean there's an effect hovering just below your acceptance threshold.
Using the null hypothesis in business
This is all a bit abstract, so let's bring it back to business. What are some examples of null hypothesis tests in the business world?
Most of the time, we choose a two-tail test because we're interested in the possibility a change might make conversion or other metrics worse. The hypothesis test we use is usually of this form:
\(H_0 : CR_B = CR_A\)
\(H_1 : CR_B \neq CR_A\)
where CR is the conversion rate, or revenue per user per branch, or add to bag etc.
Typically, these tests are one-tailed because we're only interested in an improvement. Here, the test might be:
\(H_0 : DR_B = DR_A\)
\(H_1 : DR_B < DR_A\)
where DR is the defect rate.
If all this seems a bit complex, arbitrary, and dependent on conventions, you're not alone. As it turns out, null hypothesis techniques are based on the shotgun marriage of two separate approaches to statistics. In a future blog post, I'll delve into this some more.
- You should understand that you need education and training to run these kinds of tests. A good grounding in statistics is vital.
- The results are probabilistic and not certain. A negative test doesn't mean an effect isn't there, it might just be hovering underneath the threshold of detection.