All the executives laughed

A few years ago, I was at an industry event. The speaker was an executive talking about his A/B testing program. He joked that vendors and his team were unreliable because the overall result was less than the sum of the individual tests. Everyone laughed knowingly.

But we shouldn't have laughed.

The statistics are clear and he should have known better. By the rules of the statistical game, the benefits of an A/B program will be less than the sum of the parts and I'm going to tell you why.

Thresholds and testing

An individual A/B test is a null hypothesis test with thresholds that decide the result of the test. We don't know whether there is an effect or not, we're making a decision based on probability. There are two important threshold numbers:

\(\alpha\) - also known as significance and usually set around 5%. If there really is no effect, \(\alpha\) is the probability we will say there is an effect. In other words, it's the false positive rate (Type I errors).
\(\beta\) - is usually set around 20%. If there really is an effect, \(\beta\) is the probability we will say there is no effect. In other words, it's the false negative rate (Type II errors). In practice, power is used instead of \(\beta\), power is \(1-\beta\), so it's usual to set the power to 80%.

Standard statistical practice focuses on just a single test, but an organization's choice of \(\alpha\) and \(\beta\) affect the entire test program.

\(\alpha\), \(\beta\) and the test program

To see how the choice of \(\alpha\) and \(\beta\) affect the entire test program, let's run a simplified thought experiment. Imagine we choose \(\alpha = 5\%\) and \(\beta = 20\%\), which are standard settings in most organizations. Now imagine we run 1,000 tests, in 100 of them there's a real effect and in 900 of them there's no effect. Of course, we don't know which tests have an effect and which don't.

Take a second to think about these questions before moving on:

How many many positive test results will we measure?
How many false positives will we see?
How many true positives will we see?

At this stage, you should have numbers in mind. I'm asking you to do this so you understand the importance of what happens next.

The logic to answer these questions is straightforward. In the picture below, I've shown how it works, but I'll talk you through it so you can understand it in more detail.

Of the 1,000 tests, 100 have a real effect. These are the tests that \(\beta\) applies to and \(\beta=20\%\), so we'll end up with:

20 false negatives, 80 true positives

Of the 1,000 tests, 900 have no effect. These are the tests that \(\alpha\) applies to and \(\alpha=5\%\), so we'll end up with:

855 true negatives, 45 false positives

Overall we'll measure:

125 positives made up of
80 true positives
45 false positives

Crucially, we won't know which of the 125 positives are true and which are false.

Because this is so important, I'm going to lay it out again: in this example, 36% of all test results we thought were positive are wrong, but we don't know which ones they are. They will dilute the overall results of the overall program. The overall results of the test program will be less than the sum of the individual test results.

What happens in reality

In reality, you don't know what proportion of test results are 'true'. It might be 10%, or 20%, or even 5%. Of course, the reason for the test is that you don't know the result. What this means is, it's hard to do this calculation on real data, but the fact that you can't easily do the calculation doesn't mean the limits don't apply.

Can you make things better?

To get a higher proportion of true positives, you can do at least three things.

Run fewer tests - selecting only tests where you have a good reason to believe there is a real effect. This would certainly work, but you would forgo a lot of the benefits of a testing program.
Run with a lower \(\alpha\) value. There's a huge debate in the scientific community about significance levels. Many authors are pushing for a 0.5% level instead of a 5% level. So why don't you just lower \(\alpha\)? Because the sample size will increase greatly.
Run with a higher power (lower \(\beta\)). Using a power of 80% is "industry standard", but it shouldn't be - in another blog post I'll explain why. The reason people don't do it is because of test duration - increasing the power increases the sample size.

Are there other ways to get results? Maybe, but none that are simple. Everything I've spoken about so far uses a frequentist approach. Bayesian testing offers the possibility of smaller test sizes, meaning you could increase power and reduce \(\alpha\) while still maintaining workable sample sizes. Of course, A/B testing isn't the only testing method available and other methods offer higher power with lower sample sizes.

No such thing as a free lunch

Like any discipline, statistical testing comes with its own rules and logic. There are trade-offs to be made and everything comes with a price. Yes, you can get great results from A/B testing programs, and yes companies have increased conversion, etc. using them, but all of them invested in the right people and technical resources to get there and all of them know the trade-offs. There's no such thing as a free lunch in statistical testing.

Engora Data Blog

Sunday, May 23, 2021

Why A/B tests don't add up