Showing posts with label null hypothesis. Show all posts

Sunday, April 18, 2021

A/B testing basics: ways of being right and wrong (frequentist version)

What are we trying to achieve?

In a typical A/B test, we're trying to find out if a change has a (positive) effect. For example, does changing the page layout increase the clickthrough rate? Despite what you've been told, we can't answer these types of questions with absolute certainty: the best we can do is provide a probable answer. We use statistical best practices to map a probability to a pass/fail answer.

In this blog post, I'm going to lay out some fundamentals to help you understand the process a statistician follows to translate a probabilistic result into a pass/fail result.

A typical A/B test

To provide some focus for discussion, let's imagine we're testing to see if a discount on a website increases the rate of purchase. We'll have a control branch that doesn't have the discount and a treatment branch that has the discount. We'll measure conversion for both branches: \(c_T\) for the conversion for the treatment branch and \(c_C\) for the conversion for the control branch.

This kind of test is called a null hypothesis test. The null hypothesis here is that there is no difference, the alternate hypothesis is that there is a difference. We can write this as:
\[H_0: c_T - c_C = 0\]
\[H_1: c_T - c_C \neq 0\]
There's something subtle here you need to know. The conversion rate we measure is an average conversion rate over many visitors, probably several thousand. Because of this, some very important mathematics kicks in, specifically something called the Central Limit Theorem. This theorem tells us our results will be normally distributed, in other words, \(c_T - c_C\) will be normally distributed, which is important as we'll see in a minute.

Types of error

I've blogged about null hypothesis tests before, so I'm only going to summarize things here. We can assume there's some underlying truth: either \(H_0\) or \(H_1\) is true. We don't know which is true and we're making an educated true/false guess. This gives us two ways of being right and two ways of being wrong. I've shown this in the table below.

		Null Hypothesis is
		True	False
Decision about null hypothesis	Fail to reject	True negative Correct inference Probability threshold= 1 - \( \alpha \)	False negative Type II error Probability threshold= \( \beta \)
Decision about null hypothesis	Reject	False positive Type I error Probability threshold = \( \alpha \)	True positive Correct inference Probability threshold = Power = 1 - \( \beta \)

We can't know for certain what the truth is, but we can define limits on our uncertainty. We can also define thresholds that will enable us to make reasonable pass/fail estimates. I'll show you how this works.

Assuming the null is true

The first step is to assume the null hypothesis is true, which means \( c_T - c_C = 0\). As I explained earlier, the quantity \(c_T - c_C\) is normally distributed (this is a probability distribution, which I've blogged about before). We can compare our actual measurement of \( c_T - c_C\) to the theoretical distribution and ask how likely it is that the underlying value really is zero (in other words, what's the probability of the null being true?).

Let me take a second to explain this some more. Imagine I'm trying to find out if a coin is biased. I throw it ten times and see six heads. Does this prove the coin is biased? No. It could be biased, but I don't have enough throws to say. Now imagine I've thrown the coin 100,000 times and I see 60,000 heads, does this prove bias? It's not absolutely sure, but it's highly likely the coin is biased. With statistics, we quantify this kind of analysis and set ground rules for what we consider evidence.

We can take our hypothetical A/B test and map the expected result to a standard normal distribution (very easy to do). Let's look at the standard normal distribution below, which plots a probability vs. a measurement value \(z\). Although it's true that all values are possible, the likelihood of some of them occurring is very low. For example, the probability of measuring a \(z\) value in the range \(-1 \leq z \leq 1\) is 0.68, but the probability of measuring a \(z\) value in the range \(1 \leq z \leq 3\) is only 0.16.

Certainty is impossible, but what we want to do is say whether a measurement means the null hypothesis is true or the alternate is. Put it another way, for a given measurement, how likely is it that the null is true or not? What's our threshold for acceptance/rejection? The standard procedure is to compare our measurement to the chart above. If our measurement falls in the blue zone on the chart we'll consider it means the null hypothesis is true. Anything that falls in the red zone, we'll consider the alternate is true. But we might be wrong - we can never have certainty. The size of the red area gives us the limits on our certainty. By convention, the red zones are 5% of the probability.

The standard limits we use are that we have to be in the 95% probability (blue) zone around zero to accept the null, and in the red 5% area to accept the alternate. This 5% threshold is usually called significance level and is given the symbol \(\alpha\).

Using a threshold of 5% crudely speaking means we'll be wrong 5% of the time. Let's imagine a company running 100 tests in a year, this threshold means they'll be wrong in about 5 cases.

Surely this is enough? Surely we can now do this calculation and use \(\alpha\) to say pass/fail? No. We have assumed the null is true. But we also need to do the opposite and assume the alternate is true.

Assuming the alternate is true

Now, we assume the alternate is true, that \( c_T - c_C \neq 0\). We can plot this out as a normal distribution too, but there's a difference. When we considered the null hypothesis to be true, we considered both sides of the normal curve, but here we only care about one side of the distribution. Remember, we're looking at the difference \( c_T - c_C \), so one side of the curve 'points' towards zero (no difference), and the other side points towards a bigger difference. We only care about the side that 'points' towards zero.

If there really is a difference, we expect a probability distribution like this below. We'll consider the alternate hypothesis to be true if our measurement lands in the blue zone, if it lands in the red zone, we'll reject the alternate. As before, the alternate could be true, and by chance, we could land in the red zone. The threshold value we'll use here is called \(\beta\).

For reasons I won't go into, the threshold value is called the power of a test and is given by \(1-\beta\). Typical values of power range from 80% to 95%, but 80% is considered a minimum threshold. I'll have a lot more to say about power in another blog post.

Putting it together

Usually, the two charts I've shown you are shown looking like this. The sample sizes are chosen so that \(\alpha\) and \(\beta\) line up.

For our A/B test, here are the simplified steps in the process.

Note the number of samples in each branch, in this case, the number of samples is the number of website visitors.
Work out the conversion rate for the two branches and work out \( c_T - c_C \).
Work out the probability of observing \( c_T - c_C \) if the null is true. (This is a simplification, we work out a p-value, which is the probability of observing a measurement greater than or equal to the measurement we're seeing).
Compare the p-value to \(\alpha\). If \(p < \alpha\) then we reject the null hypothesis (we believe the treatment had an effect). If \(p > \alpha\) we accept the null hypothesis (we believe the treatment had no effect).
Work out the probability of observing \( c_T - c_C \) if the alternate is true. This is the observed power. The observed power should be greater than about 80%. An observed power lower than about 80% means the test is unreliable.

How to fail

When people new to statistics get involved in A/B testing, they sometimes make the mistake of focusing only on confidence (and p-values). This gives them insight into false positives, but it says nothing at all about false negatives. To put it bluntly, this incorrect process puts all the emphasis on the risk of doing something, but none at all on the risk of doing nothing. This kind of focus also leads to tests that are too short to be reliable.

Let me put this another way. Significance is about protecting you from buying something that doesn't work. Power is about protecting you from not buying something that works.

Why not just set the thresholds higher?

The widths of the normal distributions I've shown depend on the number of samples. The more samples there are, the narrower the curve. The thresholds depend on the narrowness of the curve. To put it simply, increasing confidence and power mean increasing the number of samples in the test, which means a longer test. So all we need to do is increase the length of the test? Not so fast, the relationship isn't a linear one. Increasing power or significance by a few percentage points could double the length of the test depending on what the power and significance levels are.

Where do these thresholds come from?

The choice of a confidence value of 95% is arbitrary and comes from statistical standard practice. There's a fierce ongoing debate in the social sciences about whether this threshold is appropriate; an emerging view is that it's too lax a standard. In a recent paper in Nature, Benjamin et al [Benjamin] argued passionately that 99.5% is a better threshold.

Something similar applies to power. The 'industry standard' is 80%, a figure with a far murkier background [Cohen]. In my view, using this figure of 80% is wrong in almost all cases. 80% is a minimum. I'll have a lot more to say about power in another blog post.

Eye of newt and toe of frog...

I've talked glibly about accepting and rejecting hypothesis. This is a deliberate simplification on my part. The true statistical language is "fail to reject the null hypothesis" and "reject the null hypothesis". There are good fundamental reasons for using this language, but if you're not a statistical person, it's very confusing. I've chosen a simplified version to make my point.

The process for deciding an A/B test reads like a witches' brew recipe rather than a scientific process. It's reliant on arbitrary thresholds, some difficult concepts, and confusing language. The null hypothesis test itself is a shot-gun marriage of techniques. Unsurprisingly, p-values are widely misinterpreted and misunderstood [Amrhein].

Fundamentally, the whole process is a witches' brew; it works, but it's not satisfying.

Fortunately, there is an alternative view using a Bayesian approach which is simpler, and more enlightening. I'll talk about the Bayesian approach in another blog post. If the Bayesian approach is more satisfying, why did I show this (frequentist) approach here? Because this approach is what people are taught.

References

[Amrhein] Valentin Amrhein, Sander Greenland, Blake McShane, Scientists rise up against statistical significance, Nature 567, 305-307 (2019)

[Benjamin] Benjamin, D.J., Berger, J.O., Johannesson, M. et al. Redefine statistical significance. Nat Hum Behav 2, 6–10 (2018). https://doi.org/10.1038/s41562-017-0189-z

[Cohen] Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillside, NJ: Lawrence Erlbaum Associates.

Monday, November 2, 2020

The null hypothesis test

What's null hypothesis testing?

In business, as in many other fields, we have to make decisions in the face of uncertainty. Does this technology improve conversion? Is the new sales process working? Is the new machine tool improving quality? Almost never are the answers to these questions absolutely certain; there will be probabilities we have to trade off to make our decision.

(Two hypotheses battling it out for supremacy. Image source: Wikimedia Commons. Author: Pierdante Romei. License: Creative Commons.)

Null hypothesis tests are a set of techniques that enable us to reach probabilistic conclusions in an unbiased way. They provide a level playing field to decide if an effect is there or not.

Although null hypothesis tests are widely taught in statistics classes, many people who've come into data science from other disciplines aren't familiar with the core ideas. Similarly, people with business backgrounds sometimes end up evaluating A/B tests where the correct interpretation of null hypothesis tests is critical to understanding what's going on.

I’m going to explain to you what null hypothesis testing is and some of the concepts needed to implement and understand it.

What result are you testing for?

To put it simply, a null hypothesis test is a test of whether there is an effect of a certain size present or not. The null hypothesis is that there is no effect, and the alternate hypothesis is that there is an effect.

At its heart, the test is about probability and not certainty. We can’t say for sure if there is an effect or not, what we can say is the probability of there being an effect. But probabilities are limited and we have to make binary go/no-go decisions - so null hypothesis tests include the idea of probability thresholds for deciding whether something is there or not.

To illustrate the use of a null hypothesis test, I’m going to use a famous example, that of the lady tasting tea.

In a research lab years ago, there was a woman who claimed she could tell the difference between cups of tea prepared in one of two ways:

The milk poured into the cup first and then the tea poured in
The tea poured first and then the milk poured in.

(Image source: Wikimedia Commons Artist: Ian Smith License: Creative Commons)

The researcher decided to do a test of her abilities by asking her to taste multiple cups of tea and state how she thought each cup had been prepared. Of course, it’s possible she could be 100% successful by chance alone.

We can set up a null hypothesis test using these hypotheses:

The null hypothesis is the most conservative option. Here it’s that she can’t taste the difference. More specifically, her success rate is indistinguishable from random chance.
The alternative hypothesis is that she can tell the difference. More specifically, her success rate is significantly different from random chance.

Let's define some quantities:

\( p_T \) - the proportion of cups of tea she correctly got
\( p_C \) - the proportion of cups of tea she would be expected to get by chance alone (by guessing)

We can write the null and alternative hypotheses as:

\( H_0: p_T = p_C\)
\( H_1: p_T \neq p_C\)

But the hypotheses in this form aren't enough. Will we insist she has to be correct every single time? Is there some threshold we expect her to reach before we accept her claim?

The null hypothesis is the first step in setting up a statistical test, but to make it useful, we have to go a step further and set up thresholds. To do this, we have to understand different types of errors.

Error types

To make things easy, we’ll call 'milk first' a positive and 'milk second' a negative.

For our lady testing tea, there are four possibilities:

She can say ‘milk first’ when it was 'milk first' – a true positive
She can say ‘milk first’ when it wasn’t 'milk first' – a false positive (also known as a Type I error)
She can say ‘milk second’ when it was 'milk second' – a true negative
She can say ‘milk second’ when it wasn’t 'milk second' – a false negative (also known as a Type II error)

This is usually expressed as a table like the one below.

		Null Hypothesis is
		True	False
Decision about null hypothesis	Fail to reject	True negative Correct inference Probability threshold= 1 - \( \alpha \)	False negative Type II error Probability threshold= \( \beta \)
Decision about null hypothesis	Reject	False positive Type I error Probability threshold = \( \alpha \)	True positive Correct inference Probability threshold = Power = 1 - \( \beta \)

We can assign probabilities to each of these outcomes. As you can see, there are two numbers that are important here, \(\alpha\) and \(\beta\); however, in practice, we consider \(\alpha\) and 1-\(\beta\) as the numbers of importance. \(\alpha\) is called significance, and 1-\(\beta\) is called power. We can set values for each of them prior to the test. By convention, \(\alpha\) is usually 0.05, and 1-\(\beta \geq \) 0.80.

Test results, test size, and p-values

Our lady could guess correctly by chance alone. We have to set up the test so a positive conclusion due to randomness is unlikely, hence the use of thresholds. The easiest way to do this is to set the test size correctly, i.e. set the number of cups of tea. Through some math I won't go into, we can use \(\alpha\), (1-\(\beta\)), and the effect size to set the sample size. The effect size, in this case, is her ability to detect how the cup of tea was prepared above and beyond what would be expected by chance. For example, we might run a test to see if she was 20% better than chance.

To evaluate the test, we calculate a p-value from the test results. The p-value is the probability the test result was due to chance alone. Because this is so important, I'm going to explain it again using other words. Let's imagine the lady tasting tea was guessing. By guessing alone, she could get between 0% and 100% correct. We know the probability for each percentage. We know it's very unlikely she'll get 100% or 0% by guesswork, but more likely she'll get 50%. For the score she got, we can work out the probability of her getting this score (or higher) through chance alone. Let's say there was a 3% chance she could have gotten her score by guessing alone. Is this proof she's not guessing?

We compare the p-value to our \( \alpha\) threshold to decide which hypothesis is wrong. Let’s say our p-value was 0.03 and our \( \alpha \) value was 0.05, because 0.03 < 0.05 we reject the null hypothesis. In other words, we would accept that the lady was not guessing.

False negatives, false positives

Using \(\alpha\) and a p-value, we can work out the chance of us saying there's an effect when there is none (a false positive). But what about a false negative? We could say there's no effect when there really is one. That might be as damaging to a business as a false positive. The quantity \(\beta\) gives us the probability of a false negative. By convention, statisticians talk about the power (1-\(\beta\)) of a test which is the probability of detecting an effect of the size you think is there.

Single tail or two-tail tests

Technically, the way the null hypothesis is set up in the case of the lady tasting tea is a two-tailed test. To ‘succeed ’ she has to do a lot better than chance or she has to do a lot worse. That’s appropriate in this case because we’re trying to understand if she’s doing something else other than guessing.

We could set up the test differently so she has to only be right more often than chance suggests. This would be a one-tail test. One-tail tests are shorter than two-tail tests, but they’re more limited.

In business, we tend to do two-tailed tests rather than one-tailed tests.

Fail to reject the null or rejecting the null

Remember, we’re talking about probabilities and not certainties. Even if we gave our lady 100 cups to taste, there’s still a possibility she gets them all right due to chance alone. So we can’t say either the null or the alternate is true, all we can do is reject them at some threshold, or fail to reject them. In the case of a p-value of 0.03, a statistician wouldn’t say the alternate is true (the lady can taste the difference), but they would say ‘we reject the null hypothesis’. If the p-value was 0.1, it would be higher than the \( \alpha \) value and we would ‘fail to reject the null hypothesis’. This language is complex, but statisticians are trying to capture the idea that results are about probabilities, not certainties.

Choice of significance and power

Significance and power affect test size, so maybe we should choose them to make the test short? If you want to do a valid test, you're not free to choose any values of \(\alpha\) and (1-\(\beta\)) you choose. Convention dictates that you stick to these ranges:

\(\alpha \geq 0.95\) - anything less than this is usually considered a junk test.
(1-\(\beta) \geq 0.8\) - anything less than this is not worth doing.

The why behind these values is the subject of another blog post.

The null hypothesis test summarized

This has been a very high-level summary of what happens in a null hypothesis test, for the sake of simplicity there are several steps I've left out and I've greatly summarized some ideas. Here's a simple summary of the steps I've discussed.

Decide if the test is one-tail or two-tail.
Create a null and alternate hypothesis.
Set values for \(\alpha\) and (1-\(\beta\)) prior to the test.
After the test, calculate a p-value.
Compare the p-value to \(\alpha\) to figure out a false positive probability
Check \(\beta\) to figure out the probability of a false negative.

I've left out topics like the z-test and the t-test and a bunch of other important ideas.

Your takeaway should be that this process is complex and there are no shortcuts. At its heart, hypothesis testing is about deciding what's true when the data is uncertain and you need to do it without bias.

(Justice is supposed to be blind and balanced - like a null hypothesis test. Image source: Wikimedia Commons. License: GNU Free Documentation License.)

Problems with the null hypothesis test

Mathematically, there's controversy about the fundamentals of the procedure, but frankly, the controversy is too complex to discuss here - in any case, the controversy isn't over whether the procedures work or not.

A more serious problem is baked into the approach. At its heart, null hypothesis testing is about making a binary yes/no decision based on probabilistic data. The results are never certain. Unfortunately, test results are often taken as certain. For example, if we can't detect an effect in a test, it's often assumed there is no effect, but that's not true. This assumption that no detection = no effect has had tragic consequences in medical trials; there are high-profile cases where the negative side effects of a drug have been just below the threshold levels. Sadly, once the drugs have been released, the negative effects become well know with disastrous consequences, a good example being Vioxx.

You must be aware that a test failure doesn't mean there isn't an effect. It could mean there's an effect hovering just below your acceptance threshold.

Using the null hypothesis in business

This is all a bit abstract, so let's bring it back to business. What are some examples of null hypothesis tests in the business world?

A/B testing

Most of the time, we choose a two-tail test because we're interested in the possibility a change might make conversion or other metrics worse. The hypothesis test we use is usually of this form:

\(H_0 : CR_B = CR_A\)

\(H_1 : CR_B \neq CR_A\)

where CR is the conversion rate, or revenue per user per branch, or add to bag etc.

Manufacturing defects

Typically, these tests are one-tailed because we're only interested in an improvement. Here, the test might be:

\(H_0 : DR_B = DR_A\)

\(H_1 : DR_B < DR_A\)

where DR is the defect rate.

Closing thoughts

If all this seems a bit complex, arbitrary, and dependent on conventions, you're not alone. As it turns out, null hypothesis techniques are based on the shotgun marriage of two separate approaches to statistics. In a future blog post, I'll delve into this some more.

For now, here's what you should take away:

You should understand that you need education and training to run these kinds of tests. A good grounding in statistics is vital.
The results are probabilistic and not certain. A negative test doesn't mean an effect isn't there, it might just be hovering underneath the threshold of detection.

Reading more

https://www.sagepub.com/sites/default/files/upm-binaries/40007_Chapter8.pdf