# What are we trying to achieve?

*probable*answer. We use statistical best practices to map a probability to a pass/fail answer.

# A typical A/B test

*average*conversion rate over

*many*visitors, probably several thousand. Because of this, some very important mathematics kicks in, specifically something called the Central Limit Theorem. This theorem tells us our results will be normally distributed, in other words, \(c_T - c_C\) will be normally distributed, which is important as we'll see in a minute.

# Types of error

Null Hypothesis is |
|||

True |
False |
||

Decision about null hypothesis |
Fail to reject |
True negative Correct inference Probability threshold= 1 - \( \alpha \) |
False negative Type II error Probability threshold= \( \beta \) |

Reject |
False positive Type I error Probability threshold = \( \alpha \) |
True positive Correct inference Probability threshold = Power = 1 - \( \beta \) |

# Assuming the null is true

*is*biased. With statistics, we quantify this kind of analysis and set ground rules for what we consider evidence.

*But we might be wrong - we can never have certainty*. The size of the red area gives us the limits on our certainty. By convention, the red zone is 5% of the probability

# Assuming the alternate is true

*lot*more to say about power in another blog post.

# Putting it together

- Note the number of samples in each branch, in this case, the number of samples is the number of website visitors.
- Work out the conversion rate for the two branches and work out \( c_T - c_C \).
- Work out the probability of observing \( c_T - c_C \) if the null is true. (This is a simplification, we work out a p-value, which is the probability of observing a measurement greater than or equal to the measurement we're seeing).
- Compare the p-value to \(\alpha\). If \(p < \alpha\) then we reject the null hypothesis (we believe the treatment had an effect). If \(p > \alpha\) we accept the null hypothesis (we believe the treatment had no effect).
- Work out the probability of observing \( c_T - c_C \) if the alternate is true. This is the observed power. The observed power should be greater than about 80%. An observed power lower than about 80% means the test is unreliable.

# How to fail

# Why not just set the thresholds higher?

# Where do these thresholds come from?

*et al*[Benjamin] argued passionately that 99.5% is a better threshold.

# Eye of newt and toe of frog...

*is*a witches' brew; it works, but it's not satisfying.