Important, but overlooked
Power is a crucial number to understand for hypothesis tests, but sadly, many courses omit it and it's often poorly understood if it's understood at all. To be clear, if you're doing any kind of A/B testing, you have to understand power.
In this blog post, I'm going to teach you all about power.
Hypothesis testing
All A/B tests, all randomized control trials (RCTs), and many other forms of testing are ultimately hypothesis tests; I've blogged about what this means before. To briefly summarize and simplify, we make a statement and measure the evidence in favor or against the statement using thresholds to make our decision.
With any hypothesis test, there are four possible outcomes (using simplified language):
- The null hypothesis is actually true (there is no effect)
- We say there is no effect (true negative)
- We say there is an effect (false positive)
- The null hypothesis is actually false (there is an effect)
- We say there is no effect (false negative)
- We say there is an effect (true positive)
I've summarized the possibilities in the table below.
| Null Hypothesis is | |||
| True | False | ||
| Decision about null hypothesis | Fail to reject | True negative Correct inference Probability threshold= 1 - \( \alpha \) |
False negative Type II error Probability threshold= \( \beta \) |
| Reject | False positive Type I error Probability threshold = \( \alpha \) |
True positive Correct inference Probability threshold = Power = 1 - \( \beta \) |
|
A lot of attention goes on \(\alpha\), called the significance or significance level, which tells us the probability of a false positive. By contrast, power is the probability of detecting an effect if it's really there (true positive), sadly it doesn't get nearly the same level of focus.
By the way, there's some needless complexity here. It would seem more sensible for the two threshold numbers to be \( \alpha \) and \( \beta \) because they're defined very similarly (false positive and false negative). Unfortunately, statisticians tend to use power rather than \( \beta \).
In pictures
To get a visual sense of what power is, let's look at how a null hypothesis test works in pictures. Firstly, we assume the null is true and we draw out acceptance and rejection regions on the probability distribution (first chart). To reject the null, our test results have to land in the red rejection regions in the top chart.
Now we assume the alternate hypothesis is true (second chart). We want to land in the blue region in the second chart, and we want a certain probability (power), or more, of landing in the blue region.
To be confident there is an effect, we want the power to be as high as possible.
Calculating power - before and after
Before we run a test, we calculate the sample size we need based on a couple of factors, including the power we want the test to have. For reasons I'll explain later, 80% or 0.8 is a common choice.
Once we've run the test and we have the rest results, we then calculate the actual power based on the data we've recorded in our test. It's very common for the actual power to be different from what we specified in our test design. If the actual power is too low, that may mean we have to continue the test or redesign it.
Unfortunately, power is hard to calculate; there are no convenient closed-form formula and to make matters worse, some of the websites that offer power and sample size calculations give incorrect results. The G*Power package is probably the easiest tool for most people to use, though there are convenient libraries in R and Python that will calculate power for you. If you're going to understand power, you really do need to understand statistics.
To make all this understandable, let me walk you through a sample size calculation for a conversion rate A/B test for a website.
- A/B tests are typically large with thousands of samples, which means we're in z-test territory rather than t-test.
- We also need to decide what we're testing for. A one-sided test is testing for a difference in one direction only, either greater than or less than, a two-sided test tests for a difference (in either direction). Two-sided tests are more common because they're more informative. Some authors use the term one-tailed and two-tailed instead of one-sided or two-sided.
- Now we need to define the thresholds for our test, which are \( \alpha \) and power. Common values are 0.05 and 0.8.
- Next up we need to look at the effect, in the conversion test example, we might have a conversion rate of 2% on one branch and expected conversion rate of 2.2% on the other branch.
| Test type | Tail(s) | \( \alpha \) | Power | Proportion 1 | Proportion 2 | Sample size |
| z-test | Two-tailed | 0.05 | 0.8 | 0.02 | 0.022 | 161364 |
| z-test | Two-tailed | 0.05 | 0.95 | 0.02 | 0.022 | 267154 |
The first row of the table shows a power of 80% which leads to a sample size of 161,364. Increasing the power to 95% gives a sample size 267,154, a big increase and that's a problem. Power varies non-linearly with sample size as I've shown in the screen shot below for this data (from G*Power).
Conversion rates of 2% are typical for many retail sites. It's very rare that any technology will increase the conversion rate greatly. A 10% increase from 2% to 2.2% would be wonderful for a retailer and they'd be celebrating. Because of these numbers, you need a lot of traffic to make A/B tests work in retail, which means A/B tests can really only be used by large retailers.
Why not just reduce power and reduce the sample size? Because that's making the results of the test less reliable; at some point, you might as well just flip a coin instead of running a result. A lot of A/B tests are run when a retailer is testing new ideas or new paid-for technologies. An A/B test is there to provide a data-oriented view of whether the new thing works or not. The thresholds are there to give you a known confidence in the test results.
After a test is done, or even partway through the test, we can can calculate the observed power. Let's use G*Power and the numbers from the first row of the table above, but assume a sample size of 120,000. This gives a power of 0.67, way below what's useful and too close to a 50-50 split. Of course, it's possible that we observe a a smaller effect than expected, and you can experiment with G*Power to vary the effect size and see the affect on power.
A nightmare scenario
Let's imagine you're an analyst at a large retail company. There's a new technology which costs $500,000 a year to implement. You've been asked to evaluate the technology using an A/B test. Your conversion rate is 2% and the new technology promises a conversion rate of 2.2%. You set \(\alpha\) to 0.05, and power to 0.8 and calculate a sample size (which also gives you a test duration). The null hypothesis is that there is no effect (conversion rate of 2%) and the alternate hypothesis is that the conversion rate is 2.2%.
Your boss will ask you "how sure are you of these results?". If you say there's no effect, they will ask you "how sure are you there's no effect?", if you say there is an effect, they will ask you "how sure are you there is an effect"? Think for a moment how you'd ideally like to answer these questions (100% sure is off the cards). The level of surety you can offer depends on your website traffic and the test.
When the test is over, you calculate a p-value of 0.01, which is less than your \(\alpha\), so you reject the null hypothesis. In other words, you think there's an effect. Next you calculate power. Let's say you get a 0.75. Your threshold for accepting a conversion rate of 2.2% is 0.8. What's next?
It's quite possible that the technology works, but just not increasing the conversion rate to 2.2%. It might increase conversion to 2.05% or 2.1% for example. These kinds of conversion rate lifts might not justify the cost of the technology.
What do you do?
You have four choices, each with positives and negatives.
- Reject the new technology because it didn't pass the test. This is a fast decision, but you run the risk of foregoing technology that would have helped the business.
- Carry on with the test until it reaches your desired power. Technically, the best, but it may take more time than you have available.
- Accept the technology with the lower power. This is a risky bet and very dangerous to do it regularly (lower thresholds mean you make more mistakes).
- Try a test with a lower lift, say an alternate hypothesis that the conversion rate is 2.1%.
None of these options are great. You need strong statistics to decide on the right way forward for your business.
What's a good value?
The "industry standard" power is 80%, but where does this come from? It's actually a quote from Michael Cohen in his 1988 book "Statistical Power Analysis for the Behavioral Sciences", he said if you're stuck and can't figure out what the power should be, use 80% as a last result. Somehow the value of last resort has become an unthinking industry standard. But what value should you chose?
Let's go back to the definitions of \( \alpha \) and \( \beta \) (remember, \( \beta \) is 1 - power). \( \alpha \) corresponds to the probability of a false positive, \( \beta \) corresponds to the probability of a false negative. How do you balance these two false results? Do you think a false positive is equally as bad as false negative or do you think it's better or worse? The industry standard choices for \( \alpha \) and \( \beta \) are 0.05 and 0.20 (1 - 0.8), which means we think a false positive is four times worse than a false negative. Is that what you intended? Is that ratio appropriate for your business?
In retail, including new technologies on a website comes with a cost, but there's also the risk of forgoing revenue if you get a false negative. I'm tempted to advise you to choose the same \( \alpha \) and \( \beta \) value of 0.05 (which gives a power of 95%). This does increase the sample size and may take it beyond the reach of some websites. If you're bumping up against the limits of your traffic when designing tests, it's probably better to use something other than an A/B test.
Why is power so misunderstood?
Conceptually it's quite simple (probability of making a true positive observation), but it's wrapped up with the procedure for defining and using a null hypothesis test. Frankly, the whole null hypothesis setup is highly complex and unsatisfactory (Bayesian statistics may offer a better approach). My gut feeling is, \( \alpha \) is easy to understand, but once you get into the full language of a null hypothesis testing, people get left behind, which means they don't understand power.
Not understanding power leaves you prone to making bad mistakes, like under-powering tests. An underpowered test might mean you reject technologies that could increase conversion rate. Conversely, under-powered tests can lead you to claim a bigger effect than is really there. Overall, it leaves you vulnerable to making the wrong decision.


No comments:
Post a Comment