Monday, November 2, 2020

The null hypothesis test

What's null hypothesis testing?

In business, as in many other fields, we have to make decisions in the face of uncertainty. Does this technology improve conversion? Is the new sales process working? Is the new machine tool improving quality? Almost never are the answers to these questions absolutely certain; there will be probabilities we have to trade-off to make our decision.

(Two hypotheses battling it out for supremacy. Image source: Wikimedia Commons. Author: Pierdante Romei. License: Creative Commons.)

Null hypothesis tests are a set of techniques that enable us to reach probabilistic conclusions in an unbiased way. They provide a level playing field to decide if an effect is there or not.

Although null hypothesis tests are widely taught in statistics classes, many people who've come into data science from other disciplines aren't familiar with the core ideas. Similarly, people with business backgrounds sometimes end up evaluating A/B tests where the correct interpretation of null hypothesis tests is critical to understanding what's going on. 

I’m going to explain to you what null hypothesis testing is and some of the concepts needed to implement and understand it.

What result are you testing for?

To put it simply, a null hypothesis test is a test of whether there is an effect of a certain size present or not. The null hypothesis is that there is no effect, the alternate hypothesis is that there is an effect. 

At its heart, the test is about probability and not certainty. We can’t say for sure there is an effect or not, what we can say is the probability of there being an effect.  But probabilities are limited and we have to make binary go/no-go decisions - so null hypothesis tests include the idea of probability thresholds for deciding whether something is there or not.

To illustrate the use of a null hypothesis test, I’m going to use a famous example, that of the lady tasting tea. 

In a research lab, there was a woman who claimed she could tell the difference between cups of tea prepared in one of two ways:

  • The milk poured into the cup first and then the tea poured in
  • The tea poured first and then the milk poured in.
(Image source: Wikimedia Commons Artist: Ian Smith License: Creative Commons)

The researcher decided to do a test of her abilities by asking her to taste multiple cups of tea and state how she thought each cup had been prepared. Of course, it’s possible she could be 100% successful by chance alone. 

We can set up a null hypothesis test using these hypotheses:

  • The null hypothesis is the most conservative option. Here it’s that she can’t taste the difference. More specifically, her success rate is indistinguishable from random chance.
  • The alternative hypothesis is that she can tell the difference. More specifically, her success rate is significantly different from random chance.
Let's define some quantities:
  • \( p_T \) - the proportion of cups of tea she correctly got
  • \( p_C \)  - the proportion of cups of tea she would be expected to get by chance alone (by guessing)
We can write the null and alternative hypothesis as:

  • \( H_0: p_T = p_C\) 
  • \( H_1: p_T \neq p_C\) 

But – the hypotheses in this form aren't enough. Will we insist she has to be correct every single time? Is there some threshold we expect her to reach before we accept her claim?

The null hypothesis is the first step in setting up a statistical test, but to make it useful, we have to go a step further and set up thresholds. To do this, we have to understand types of errors.

Error types

To make things easy, we’ll call 'milk first' a positive and 'milk second' a negative.

For our lady testing tea, there are four possibilities:

  • She can say ‘milk first’ when it was 'milk first' – a true positive
  • She can say ‘milk first’ when it wasn’t 'milk first' – a false positive (also known as a Type I error)
  • She can say ‘milk second’ when it was 'milk second' – a true negative
  • She can say ‘milk second’ when it wasn’t 'milk second' – a false negative (also known as a Type II error)

This is usually expressed as a table like the one below.

    Null Hypothesis is
    True False
Decision about null hypothesis  Fail to reject True negative
Correct inference
Probability threshold= 1 - \( \alpha \)
False negative
Type II error
Probability threshold= \( \beta \)
Reject False positive
Type I error
Probability threshold = \( \alpha \)
True positive
Correct inference
Probability threshold = Power = 1 - \( \beta \)

We can assign probabilities to each of these outcomes. As you can see, there are two numbers that are important here, \(\alpha\) and \(\beta\); however, in practice, we consider \(\alpha\) and 1-\(\beta\) as the numbers of importance. \(\alpha\) is called significance, and 1-\(\beta\) is called power. We can set values for each of them prior to the test. By convention, \(\alpha\) is usually 0.05, and 1-\(\beta \geq \) 0.80.

Test results, test size, and p-values

Our lady could guess correctly by chance alone. We have to set up the test so a positive conclusion due to randomness is unlikely, hence the use of thresholds. The easiest way to do this is to set the test size correctly, i.e. set the number of cups of tea. Through some math I won't go into, we can use \(\alpha\), (1-\(\beta\)), and the effect size to set the sample size. The effect size, in this case, is her ability to detect how the cup of tea was prepared above and beyond what would be expected by chance. For example, we might run a test to see if she was 20% better than chance.

To evaluate the test, we calculate a p-value from the test results. The p-value is the probability the test result was due to chance alone. Because this is so important, I'm going to explain it again using other words. Let's imagine the lady tasting tea was guessing. By guessing alone, she could get between 0% and 100% correct. We know the probability for each percentage. We know it's very unlikely she'll get 100% or 0% by guesswork, but more likely she'll get 50%. For the score she got, we can work out a probability of her getting this score (or higher) through chance alone. Let's say there was a 3% chance she could have got her score by guessing alone. Is this proof she's not guessing?

We compare the p-value to our \( \alpha\) threshold to decide which hypothesis is wrong. Let’s say our p-value was 0.03 and our \( \alpha \) value was 0.05, because 0.03 < 0.05 we reject the null hypothesis. In other words, we would accept that the lady was not guessing.

False negatives, false positives

Using \(\alpha\) and a p-value, we can work out the chance of us saying there's an effect when there is none (a false positive). But what about a false negative? We could say there's no effect when there really is one. That might be as damaging to a business as a false positive. The quantity \(\beta\) gives us the probability of a false negative. By convention, statisticians talk about the power (1-\(\beta\)) of a test which is the probability of detecting an effect of the size you think is there.

Single tail or two-tail tests

Technically, the way the null hypothesis is set up in the case of the lady tasting tea is a two-tailed test. To ‘succeed ’ she has to do a lot better than chance or she has to do a lot worse. That’s appropriate in this case because we’re trying to understand if she’s doing something else other than guessing.

We could set up the test differently so she has to only be right more often than chance suggests. This would be a one-tail test. One-tail tests are shorter than two-tail tests, but they’re more limited. In business, we tend to do two-tailed tests rather than one-tailed tests.

Fail to reject the null or rejecting the null

Remember, we’re talking about probabilities and not certainties. Even if we gave our lady 100 cups to taste, there’s still a possibility she gets them all right due to chance alone. So we can’t say either the null or the alternate is true, all we can do is reject them at some threshold, or fail to reject them. In the case of a p-value of 0.03, a statistician wouldn’t say the alternate is true (the lady can taste the difference), but they would say ‘we reject the null hypothesis’. If the p-value was 0.1, it would be higher than the \( \alpha \) value and we would ‘fail to reject the null hypothesis’. This language is complex, but statisticians are trying to capture the idea that results are about probabilities, not certainties.

Choice of significance and power

Significance and power affect test size, so maybe we should choose them to make the test short? If you want to do a valid test, you're not free to choose any values of \(\alpha\) and (1-\(\beta\)) you choose. Convention dictates that you stick to these ranges:

  • \(\alpha \geq 0.95\) - anything less than this is usually considered a junk test.
  • (1-\(\beta) \geq 0.8\) - anything less than this is not worth doing. 

The why behind these values is the subject of another blog post.

The null hypothesis test summarized

This has been a very high-level summary of what happens in a null hypothesis test, for the sake of simplicity there are several steps I've left out and I've greatly summarized some ideas. Here's a simple summary of the steps I've discussed.

  1. Decide if the test is one-tail or two-tail.
  2. Create a null and alternate hypothesis.
  3. Set values for \(\alpha\) and (1-\(\beta\)) prior to the test.
  4. After the test, calculate a p-value.
  5. Compare the p-value to \(\alpha\) to figure out a false positive probability
  6. Check \(\beta\) to figure out the probability of a false negative.

I've left out topics like the z-test and the t-test and a bunch of other important ideas. 

Your takeaway should be that this process is complex and there are no shortcuts. At its heart, hypothesis testing is about deciding what's true when the data is uncertain and you need to do it without bias.


(Justice is supposed to be blind and balanced - like a null hypothesis test. Image source: Wikimedia Commons. License:  GNU Free Documentation License.)

Problems with the null hypothesis test

Mathematically, there's controversy about the fundamentals of the procedure, but frankly, the controversy is too complex to discuss here - in any case, the controversy isn't over whether the procedures work or not.

A more serious problem is baked into the approach. At its heart, null hypothesis testing is about making a binary yes/no decision based on probabilistic data. The results are never certain. Unfortunately, test results are often taken as certain. For example, if we can't detect an effect in a test, it's often assumed there is no effect, but that's not true. This assumption that no detection = no effect has had tragic consequences in medical trials; there are high profile cases where negative side effects of a drug have been just below the threshold levels. Sadly, once the drugs have been released, the negative effects become well know with disastrous consequences, a good example being Vioxx.

You must be aware that a test failure doesn't mean there isn't an effect. It could mean there's an effect hovering just below your acceptance threshold.

Using the null hypothesis in business

This is all a bit abstract, so let's bring it back to business. What are some examples of null hypothesis tests in the business world?

A/B testing

Most of the time, we choose a two-tail test because we're interested in the possibility a change might make conversion or other metrics worse. The hypothesis test we use is usually of this form:

\(H_0 : CR_B = CR_A\)

\(H_1 : CR_B \neq CR_A\)

where CR is the conversion rate, or revenue per user per branch, or add to bag etc.

Manufacturing defects

Typically, these tests are one-tailed because we're only interested in an improvement. Here, the test might be:

\(H_0 : DR_B = DR_A\)

\(H_1 : DR_B < DR_A\)

where DR is the defect rate.

Closing thoughts

If all this seems a bit complex, arbitrary, and dependent on conventions, you're not alone. As it turns out, null hypothesis techniques are based on the shotgun marriage of two separate approaches to statistics. In a future blog post, I'll delve into this some more. 

For now, here's what you should take away:
  • You should understand that you need education and training to run these kinds of tests. A good grounding in statistics is vital.
  • The results are probabilistic and not certain. A negative test doesn't mean an effect isn't there, it might just be hovering underneath the threshold of detection.

Reading more

2 comments:

  1. Hi Mike!

    Here is a simple way to grasp the meaning of the null hypothesis.

    Specifying a hypothesis as "the null hypothesis" is a way to specify that it is supported by lots of "prior data" (data from past experiments that are not directly considered in the present test).

    Example: You have a medication that has been tested on 1 million patients over many years, and it is effective with probability 0.99. You do a small test on 100 people, and the drug seems ineffective 10% of the time. Would you reject the hypothesis that the drug is 99% effective? Of course not. (You should first try to replicate your result and identify biases in your test.)

    More precisely,

    if prob(ineffective in your test) > alpha
    then retain the drug
    else reject the drug,

    where alpha = your estimate of probability of rejecting H0 when H0 is true.

    The most obvious weakness of this method is that your estimate of alpha does not quantify how much prior testing you rely on. Some methods, including Bayesian methods, count the number of prior trials and combine that number with the number of trials in your test.

    ReplyDelete
    Replies
    1. Thanks for your comment. I'm going to talk about Bayesian methods later. To be honest, I'm not a huge fan of null hypothesis testing; it feels overly complex and reliant on magic numbers. It does work, but... To me, it feels like going to a bad restaurant, yes, the meal fills you up, but you can't help but think another restaurant would have been tastier.

      Delete