Showing posts with label a/b testing. Show all posts
Showing posts with label a/b testing. Show all posts

Friday, February 18, 2022

RCT bingo!

A vocabulary of causal inference testing

I was having a clear-out and I came across a printout of some notes I made a while back. It was a list of terms used in causal inference testing. At the time, I used it as a checklist or dictionary to ensure I knew what I was talking about - a kind of RCT bingo if you like.

(Myriam Thomas, CC BY-SA 4.0, via Wikimedia Commons)

I thought I would post it here in case anyone wants to play the same game. Do you know what all these terms mean? Are there key terms I've missed off my list?

  • ATE - Average Treatment Effect
  • CATE - Conditional Average Treatment Effect
  • Counterfactual
  • DAG - Directed Acyclic Graph
  • Dynamic Treatment Effect
  • Epsilon greedy
  • Estimands
  • External and internal validity
  • Heterogeneity (treatment effect heterogeneity) 
  • Homophily
  • Instrumental Variable (IV)
  • LATE - Local Average Treatment Effect
  • Logit model
  • RCT - Randomized Control Trial
  • Regret
  • Salience
  • Spillover
  • Stationary effect (and it's opposite non-stationary effect)
  • Surrogate
  • SUTVA - Stable Unit Treatment Value Assumption
  • Thompson sampling
  • Treatment effect heterogeneity
  • Wald estimator

Sunday, May 23, 2021

Why A/B tests don't add up

All the executives laughed

A few years ago, I was at an industry event. The speaker was an executive talking about his A/B testing program. He joked that vendors and his team were unreliable because the overall result was less than the sum of the individual tests. Everyone laughed knowingly.

But we shouldn't have laughed.

The statistics are clear and he should have known better. By the rules of the statistical game, the benefits of an A/B program will be less than the sum of the parts and I'm going to tell you why.

Thresholds and testing

An individual A/B test is a null hypothesis test with thresholds that decide the result of the test. We don't know whether there is an effect or not, we're making a decision based on probability. There are two important threshold numbers:

  • \(\alpha\) - also known as significance and usually set around 5%. If there really is no effect, \(\alpha\) is the probability we will say there is an effect. In other words, it's the false positive rate (Type I errors).
  • \(\beta\) - is usually set around 20%. If there really is an effect, \(\beta\) is the probability we will say there is no effect. In other words, it's the false negative rate (Type II errors). In practice, power is used instead of \(\beta\), power is \(1-\beta\), so it's usual to set the power to 80%.

Standard statistical practice focuses on just a single test, but an organization's choice of \(\alpha\) and \(\beta\) affect the entire test program.

\(\alpha\), \(\beta\) and the test program

To see how the choice of \(\alpha\) and \(\beta\) affect the entire test program, let's run a simplified thought experiment. Imagine we choose  \(\alpha = 5\%\) and \(\beta = 20\%\), which are standard settings in most organizations. Now imagine we run 1,000 tests, in 100 of them there's a real effect and in 900 of them there's no effect. Of course, we don't know which tests have an effect and which don't.

Take a second to think about these questions before moving on:

  • How many many positive test results will we measure?
  • How many false positives will we see?
  • How many true positives will we see?

At this stage, you should have numbers in mind. I'm asking you to do this so you understand the importance of what happens next.

The logic to answer these questions is straightforward. In the picture below, I've shown how it works, but I'll talk you through it so you can understand it in more detail.

Of the 1,000 tests, 100 have a real effect. These are the tests that \(\beta\) applies to and \(\beta=20\%\), so we'll end up with:

  • 20 false negatives, 80 true positives

Of the 1,000 tests, 900 have no effect. These are the tests that \(\alpha\) applies to and \(\alpha=5\%\), so we'll end up with:

  • 855 true negatives, 45 false positives

Overall we'll measure:

  • 125 positives made up of
  • 80 true positives
  • 45 false positives

Crucially, we won't know which of the 125 positives are true and which are false.

Because this is so important, I'm going to lay it out again: in this example, 36% of all test results we thought were positive are wrong, but we don't know which ones they are. They will dilute the overall results of the overall program. The overall results of the test program will be less than the sum of the individual test results.

What happens in reality

In reality, you don't know what proportion of test results are 'true'. It might be 10%, or 20%, or even 5%. Of course, the reason for the test is that you don't know the result. What this means is, it's hard to do this calculation on real data, but the fact that you can't easily do the calculation doesn't mean the limits don't apply.

Can you make things better?

To get a higher proportion of true positives, you can do at least three things.

  • Run fewer tests - selecting only tests where you have a good reason to believe there is a real effect. This would certainly work, but you would forgo a lot of the benefits of a testing program.
  • Run with a lower \(\alpha\) value. There's a huge debate in the scientific community about significance levels. Many authors are pushing for a 0.5% level instead of a 5% level. So why don't you just lower \(\alpha\)? Because the sample size will increase greatly.
  • Run with a higher power (lower \(\beta\)). Using a power of 80% is "industry standard", but it shouldn't be - in another blog post I'll explain why. The reason people don't do it is because of test duration - increasing the power increases the sample size.

Are there other ways to get results? Maybe, but none that are simple. Everything I've spoken about so far uses a frequentist approach. Bayesian testing offers the possibility of smaller test sizes, meaning you could increase power and reduce \(\alpha\) while still maintaining workable sample sizes. Of course, A/B testing isn't the only testing method available and other methods offer higher power with lower sample sizes.

No such thing as a free lunch 

Like any discipline, statistical testing comes with its own rules and logic. There are trade-offs to be made and everything comes with a price. Yes, you can get great results from A/B testing programs, and yes companies have increased conversion, etc. using them, but all of them invested in the right people and technical resources to get there and all of them know the trade-offs. There's no such thing as a free lunch in statistical testing.

Sunday, April 18, 2021

A/B testing basics: ways of being right and wrong (frequentist version)

What are we trying to achieve?

In a typical A/B test, we're trying to find out if a change has a (positive) effect. For example, does changing the page layout increase the clickthrough rate? Despite what you've been told, we can't answer these types of questions with absolute certainty: the best we can do is provide a probable answer.  We use statistical best practices to map a probability to a pass/fail answer. 

In this blog post, I'm going to lay out some fundamentals to help you understand the process a statistician follows to translate a probabilistic result into a pass/fail result. 

A typical A/B test

To provide some focus for discussion, let's imagine we're testing to see if a discount on a website increases the rate of purchase. We'll have a control branch that doesn't have the discount and a treatment branch that has the discount. We'll measure conversion for both branches: \(c_T\) for the conversion for the treatment branch and \(c_C\) for the conversion for the control branch.

This kind of test is called a null hypothesis test. The null hypothesis here is that there is no difference, the alternate hypothesis is that there is a difference. We can write this as:
\[H_0: c_T  - c_C = 0\]
\[H_1: c_T - c_C \neq 0\]
There's something subtle here you need to know. The conversion rate we measure is an average conversion rate over many visitors, probably several thousand. Because of this, some very important mathematics kicks in, specifically something called the Central Limit Theorem. This theorem tells us our results will be normally distributed, in other words, \(c_T - c_C\) will be normally distributed, which is important as we'll see in a minute.

Types of error

I've blogged about null hypothesis tests before, so I'm only going to summarize things here. We can assume there's some underlying truth: either \(H_0\) or \(H_1\) is true. We don't know which is true and we're making an educated true/false guess. This gives us two ways of being right and two ways of being wrong. I've shown this in the table below.

    Null Hypothesis is
    True False
Decision about null hypothesis  Fail to reject True negative
Correct inference
Probability threshold= 1 - \( \alpha \)
False negative
Type II error
Probability threshold= \( \beta \)
Reject False positive
Type I error
Probability threshold = \( \alpha \)
True positive
Correct inference
Probability threshold = Power = 1 - \( \beta \)

We can't know for certain what the truth is, but we can define limits on our uncertainty. We can also define thresholds that will enable us to make reasonable pass/fail estimates. I'll show you how this works.

Assuming the null is true

The first step is to assume the null hypothesis is true, which means \( c_T  - c_C = 0\). As I explained earlier, the quantity \(c_T - c_C\) is normally distributed (this is a probability distribution, which I've blogged about before). We can compare our actual measurement of  \( c_T  - c_C\) to the theoretical distribution and ask how likely it is that the underlying value really is zero (in other words, what's the probability of the null being true?). 

Let me take a second to explain this some more. Imagine I'm trying to find out if a coin is biased. I throw it ten times and see six heads. Does this prove the coin is biased? No. It could be biased, but I don't have enough throws to say. Now imagine I've thrown the coin 100,000 times and I see 60,000 heads, does this prove bias? It's not absolutely sure, but it's highly likely the coin is biased. With statistics, we quantify this kind of analysis and set ground rules for what we consider evidence.

We can take our hypothetical A/B test and map the expected result to a standard normal distribution (very easy to do). Let's look at the standard normal distribution below, which plots a probability vs. a measurement value \(z\). Although it's true that all values are possible, the likelihood of some of them occurring is very low. For example, the probability of measuring a \(z\) value in the range \(-1  \leq z \leq 1\) is 0.68, but the probability of measuring a \(z\) value in the range \(1  \leq z \leq 3\) is only 0.16.



Certainty is impossible, but what we want to do is say whether a measurement means the null hypothesis is true or the alternate is. Put it another way, for a given measurement, how likely is it that the null is true or not? What's our threshold for acceptance/rejection? The standard procedure is to compare our measurement to the chart above. If our measurement falls in the blue zone on the chart we'll consider it means the null hypothesis is true. Anything that falls in the red zone, we'll consider the alternate is true. But we might be wrong - we can never have certainty. The size of the red area gives us the limits on our certainty. By convention, the red zones are 5% of the probability.

The standard limits we use are that we have to be in the 95% probability (blue) zone around zero to accept the null, and in the red 5% area to accept the alternate. This 5% threshold is usually called significance level and is given the symbol \(\alpha\). 

Using a threshold of 5% crudely speaking means we'll be wrong 5% of the time. Let's imagine a company running 100 tests in a year, this threshold means they'll be wrong in about 5 cases.

Surely this is enough? Surely we can now do this calculation and use \(\alpha\) to say pass/fail? No. We have assumed the null is true. But we also need to do the opposite and assume the alternate is true. 

Assuming the alternate is true

Now, we assume the alternate is true, that \( c_T - c_C \neq 0\). We can plot this out as a normal distribution too, but there's a difference. When we considered the null hypothesis to be true, we considered both sides of the normal curve, but here we only care about one side of the distribution. Remember, we're looking at the difference \( c_T - c_C \), so one side of the curve 'points' towards zero (no difference), and the other side points towards a bigger difference. We only care about the side that 'points' towards zero.

If there really is a difference, we expect a probability distribution like this below. We'll consider the alternate hypothesis to be true if our measurement lands in the blue zone, if it lands in the red zone, we'll reject the alternate. As before, the alternate could be true, and by chance, we could land in the red zone. The threshold value we'll use here is called \(\beta\). 


For reasons I won't go into, the threshold value is called the power of a test and is given by \(1-\beta\). Typical values of power range from 80% to 95%, but 80% is considered a minimum threshold. I'll have a lot more to say about power in another blog post.

Putting it together

Usually, the two charts I've shown you are shown looking like this. The sample sizes are chosen so that \(\alpha\) and \(\beta\) line up.



For our A/B test, here are the simplified steps in the process.

  1. Note the number of samples in each branch, in this case, the number of samples is the number of website visitors.
  2. Work out the conversion rate for the two branches and work out \( c_T - c_C \).
  3. Work out the probability of observing \( c_T - c_C \) if the null is true. (This is a simplification, we work out a p-value, which is the probability of observing a measurement greater than or equal to the measurement we're seeing).
  4. Compare the p-value to \(\alpha\). If \(p < \alpha\) then we reject the null hypothesis (we believe the treatment had an effect). If \(p > \alpha\) we accept the null hypothesis (we believe the treatment had no effect).
  5. Work out the probability of observing \( c_T - c_C \) if the alternate is true. This is the observed power. The observed power should be greater than about 80%. An observed power lower than about 80% means the test is unreliable.

How to fail

When people new to statistics get involved in A/B testing, they sometimes make the mistake of focusing only on confidence (and p-values). This gives them insight into false positives, but it says nothing at all about false negatives. To put it bluntly, this incorrect process puts all the emphasis on the risk of doing something, but none at all on the risk of doing nothing. This kind of focus also leads to tests that are too short to be reliable.

Let me put this another way. Significance is about protecting you from buying something that doesn't work. Power is about protecting you from not buying something that works.

Why not just set the thresholds higher?

The widths of the normal distributions I've shown depend on the number of samples. The more samples there are, the narrower the curve. The thresholds depend on the narrowness of the curve. To put it simply, increasing confidence and power mean increasing the number of samples in the test, which means a longer test. So all we need to do is increase the length of the test? Not so fast, the relationship isn't a linear one. Increasing power or significance by a few percentage points could double the length of the test depending on what the power and significance levels are.

Where do these thresholds come from?

The choice of a confidence value of 95% is arbitrary and comes from statistical standard practice. There's a fierce ongoing debate in the social sciences about whether this threshold is appropriate; an emerging view is that it's too lax a standard. In a recent paper in Nature, Benjamin et al [Benjamin] argued passionately that 99.5% is a better threshold. 

Something similar applies to power. The 'industry standard' is 80%, a figure with a far murkier background [Cohen]. In my view, using this figure of 80% is wrong in almost all cases. 80% is a minimum. I'll have a lot more to say about power in another blog post.

Eye of newt and toe of frog...

I've talked glibly about accepting and rejecting hypothesis. This is a deliberate simplification on my part. The true statistical language is "fail to reject the null hypothesis" and "reject the null hypothesis". There are good fundamental reasons for using this language, but if you're not a statistical person, it's very confusing. I've chosen a simplified version to make my point.

The process for deciding an A/B test reads like a witches' brew recipe rather than a scientific process. It's reliant on arbitrary thresholds, some difficult concepts, and confusing language. The null hypothesis test itself is a shot-gun marriage of techniques. Unsurprisingly, p-values are widely misinterpreted and misunderstood [Amrhein]. 

Fundamentally, the whole process is a witches' brew; it works, but it's not satisfying. 

Fortunately, there is an alternative view using a Bayesian approach which is simpler, and more enlightening. I'll talk about the Bayesian approach in another blog post. If the Bayesian approach is more satisfying, why did I show this (frequentist) approach here? Because this approach is what people are taught.

References

[Amrhein] Valentin Amrhein, Sander Greenland, Blake McShane, Scientists rise up against statistical significance, Nature 567, 305-307 (2019)

[Benjamin] Benjamin, D.J., Berger, J.O., Johannesson, M. et al. Redefine statistical significance. Nat Hum Behav 2, 6–10 (2018). https://doi.org/10.1038/s41562-017-0189-z

[Cohen] Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillside, NJ: Lawrence Erlbaum Associates.