Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

# Tossing and turning

A few months ago, someone commented on one of my blog posts and asked how you work out if a coin is biased or not. I've been thinking about the problem since then. It's not a difficult one, but it does bring up some core notions in probability theory and statistics which are very relevant to understanding how A/B testing works, or indeed any kind of statistical test. I'm going to talk you through how you figure out if a coin is biased, including an explanation of some of the basic ideas of statistical tests.

# The trial

A single coin toss is an example of something called a Bernoulli trial, which is any kind of binary decision you can express as a success or failure (e.g. heads or tails). For some reason, most probability texts refer to heads as a success.

We can work out what the probability is of getting different numbers of heads from a number of tosses, or more formally, what's the probability $$P(k)$$ of getting $$k$$ heads from $$n$$ tosses, where $$0 < k ≤ n$$? By hand, we can do it for a few tosses:

 Number of heads (k) Combinations (n) Count Probability 0 TTT 1 1/8 1 HTT THT TTH 3 3/8 2 THH HTH HHT 3 3/8 4 HHH 1 1/8

But what about 1,000 or 1,000,000 tosses - we can't do this many by hand, so what can we do? As you might expect, there's a formula you can use:
$P(k) = \frac{n!} {k!(n-k)!} p^k (1-p)^{n-k}$
$$p$$ is the probability of success in any trial, for example, getting a head. For an unbiased coin $$p=0.5$$; for a coin that's biased 70% heads $$p=0.7$$.

If we plot this function for an unbiased coin ($$p=0.5$$), where $$n=100$$, and $$0 < k ≤ n$$, we see this probability distribution:

This is called a binomial distribution and it looks a lot like the normal distribution for large ($$> 30$$) values of $$n$$.

I'm going to re-label the x-axis as a score equal to the fraction of heads: 0 means all tails, 0.5 means $$\frac{1}{2}$$ heads, and 1 means all heads. With this slight change, we can more easily compare the shape of the distribution for different values of $$n$$.

I've created two charts below for an unbiased coin ($$p=0.5$$), one with $$n=20$$ and one with $$n=40$$. Obviously, the $$n=40$$ chart is narrower, which is easier to see using the score as the x-axis.

As an illustration of what these charts mean, I've colored all scores 0.7 and higher as red. You can see the red area is bigger for $$n=20$$ than $$n=40$$. Bear in mind, the red area represents the probability of a score of 0.7 or higher. In other words, if you toss a fair coin 20 times, you have a 0.058 chance of seeing a score of 0.7 or more, but if you toss a fair coin 40 times, the probability of seeing a 0.7 score drops to 0.008.

These charts tell us something useful: as we increase the number of tosses, the curve gets narrower, meaning the probability of getting results further away from $$0.5$$ gets smaller. If we saw a score of 0.7 for 20 tosses, we might not be able to say the coin was biased, but if we got a score of 0.7 after 40 tosses, we know this score is very unlikely so the coin is more likely to be biased.

# Thresholds

Let me re-state some facts:

• For any coin (biased or unbiased) any score from 0 to 1 is possible for any number of tosses.
• Some results are less likely than others; e.g. for an unbiased coin and 40 tosses, there's only a 0.008 chance of seeing a score of 0.7.

We can use probability thresholds to decide between biased and non-biased coins.  We're going to use a threshold (mostly called confidence) of 95% to decide if the coin is biased or not. In the chart below, the red areas represent 5% probability, and the blue areas 95% probability.

Here's the idea to work out if the coin is biased. Set a confidence value, usually at 0.05. Throw the coin $$n$$ times, record the number of heads and work out a score. Draw the theoretical probability chart for the number of throws (like the one I've drawn above) and color in 95% of the probabilities blue and 5% red. If the experimental score lands in the red zones, we'll consider the coin to be biased, if it lands in the blue zone, we'll consider it unbiased.

This is probabilistic decision-making. Using a confidence of 0.05 means we'll wrongly say a coin is biased 5% of the time. Can we make the threshold higher, could we use 0.01 for instance? Yes, we could, but the cost is increasing the number of trials.

As you might expect, there are shortcuts and we don't actually have to draw out the chart. In Python, you can use the binom_test function in the stats package.

To simplify, binom_test has three arguments:

• x - the number of successes
• n - the number of samples
• p - the hypothesized probability of success
It returns a p-value which we can use to make a decision.

Let's see how this works with a confidence of 0.05. Let's take the case where we have 200 coin tosses and 140 (70%) of them come up heads. We're hypothesizing that the coin is fair, so $$p=0.5$$.

from scipy import stats
print(stats.binom_test(x=140, n=200, p=0.5))

the p-value we get is 1.5070615573524992e-08 which is way less than our confidence threshold of 0.05 (we're in the red area of the chart above). We would then reject the idea the coin is fair.

from scipy import stats
print(stats.binom_test(x=115, n=200, p=0.5))

This time, the p-value is 0.10363903843786755, which is greater than our confidence threshold of 0.05 (we're in the blue area of the chart), so the result is consistent with a fair coin (we fail to reject the null).

# What if my results are not significant? How many tosses?

Let's imagine you have reason to believe the coin is biased. You throw it 200 times and you see 115 heads. binom_test tells you you can't conclude the coin is biased. So what do you do next?

The answer is simple, toss the coin more times.

The formulae for the sample size, $$n$$, is:

$n = \frac{p(1-p)} {\sigma^2}$

where $$\sigma$$ is the standard error.

Here's how this works in practice. Let's assume we think our coin is just a little biased, to 0.55, and we want the standard error to be $$\pm 0.04$$. Here's how many tosses we would need: 154. What if we want more certainty, say $$\pm 0.005$$, then the number of tosses goes up to 9,900. In general, the bigger the bias, the fewer tosses we need, and the more certainty we want the more tosses we need.

# If I think my coin is biased, what's my best estimate of the bias?

Let's imagine I toss the coin 1,000 times and see 550 heads. binom_test tells me the result is significant and it's likely my coin is biased, but what's my estimate of the bias? This is simple, it's actually just the mean, so 0.55. Using the statistics of proportions, I can actually put a 95% confidence interval around my estimate of the bias of the coin. Through math I won't show here, using the data we have, I can estimate the coin is biased 0.55 ± 0.03.

# Is my coin biased?

This is a nice theoretical discussion, but how might you go about deciding if a coin is biased? Here's a step-by-step process.

1. Decide on the level of certainty you want in your results. 95% is a good measure.
2. Decide the minimum level of bias you want to detect. If the coin should return heads 50% of the time, what level of bias can you live with? If it's biased to 60%, is this OK? What about biased to 55% or 50.5%?
3. Calculate the number of tosses you need.
5. Use binom_test to figure out if the coin deviates significantly from 0.5.

# What does data science boil down to?

Data science is a relatively new discipline that means different things to different people (most notably, to different employers). Some organizations focus solely on machine learning, while other lean on interpretation, and yet others get close to data engineering. In my view, all of these are part of the data science role.

I would argue data science generally is about three distinct areas:

• Prediction. The ability to accurately extrapolate from existing data sets to make forecasts about future behavior. This is the famous machine learning aspect and includes solutions like recommender systems.
• Distinction. The key question here is: "are these numbers different?". This includes the use of statistical techniques to decide if there's a difference or not, for example, specifying an A/B test and explaining its results.
• Interpretation. What are the factors that are driving the system? This is obviously related to prediction but has similarities to distinction too.

(A similar view of data science to mine: Calvin.Andrus, CC BY-SA 3.0, via Wikimedia Commons)

I'm going to talk through these areas and list the skills I think a data scientist needs. In my view, to be effective, you need all three areas. The real skill is to understand what type of problem you face and to use the correct approach.

# Distinction - are these numbers different?

This is perhaps the oldest area and the one you might disagree with me on. Distinction is firmly in the realm of statistics. It's not just about A/B tests or quasi-experimental tests, it's also about evaluating models too.

Here's what you need to know:

• Confidence intervals.
• Sample size calculations. This is crucial and often overlooked by experienced data scientists. If your data set is too small, you're going to get junk results so you need to know what too small is. In the real world. increasing the sample size is often not an option and you need to know why.
• Hypothesis testing. You should know the difference between a t-test and a z-test and when a z-test is appropriate (hint: sample size).
• α, β, and power. Many data scientists have no idea what statistical power is. If you're doing any kind of statistical testing, you need to have a firm grasp of power.
• The requirements for running a randomized control trial (RCT). Some experienced data scientists have told me they were analyzing results from an RCT, but their test just wasn't an RCT - they didn't really understand what an RCT was.
• Quasi-experimental methods. Sometimes, you just can't run an RCT, but there are other methods you can use including difference-in-difference, instrumental variables, and regression discontinuity.  You need to know which method is appropriate and when.
• Regression to the mean. This is why you almost always need a control group. I've seen experienced data scientists present results that could almost entirely be explained by regression to the mean. Don't be caught out by one of the fundamentals of statistics.

# Prediction - what will happen next?

This is the piece of data science that gets all the attention, so I won't go into too much detail.

Here's what you need to know:

• The basics of machine learning models, including:
• Generalized linear modeling
• Random forests (including knowing why they are often frowned upon)
• k-nearest neighbors/k-means clustering
• Support Vector Machines
• Cross-validation, regularization, and their limitations.
• Variable importance and principal component analysis.
• Loss functions, including RMSE.
• The confusion matrix, accuracy, sensitivity, specificity, precision-recall and ROC curves.

There's one topic that's not on any machine learning course or in any machine learning book that I've ever read, but it's crucially important: knowing when machine learning fails and when to stop a project.  Machine learning doesn't work all the time.

# Interpretation - what's going on?

The main techniques here are often data visualization. Statistical summaries are great, but they can often mislead. Charts give a fuller picture.

Here are some techniques all data scientists should know:

• Heatmaps
• Violin plots
• Scatter plots and curve fitting
• Bar charts
• Regression and curve fitting.

They should also know why pie charts in all their forms are bad.

A good knowledge of how charts work is very helpful too (the psychology of visualization).

# What about SQL and R and Python...?

You need to be able to manipulate data to do data science, which means SQL, Python, or R. But plenty of people use these languages without being data scientists. In my view, despite their importance, they're table stakes.

# Book knowledge vs. street knowledge

People new to data science tend to focus almost exclusively on machine learning (prediction in my terminology) which leaves them very weak on data analysis and data exploration; even worse, their lack of statistical knowledge sometimes leads them to make blunders on sample size and loss functions. No amount of cross-validation, regularization, or computing power will save you from poor modeling choices. Even worse, not knowing statistics can lead people to produce excellent models of regression to the mean.

Practical experience is hugely important; way more important than courses. Obviously, a combination of both is best, which is why PhDs are highly sought after; they've learned from experience and have the theoretical firepower to back up their practical knowledge.

# COVID and the base rate fallacy

## Should we be concerned that vaccinated people are getting COVID?

I’ve spoken to people who’re worried that the COVID vaccines aren’t effective because some vaccinated people catch COVID and are hospitalized. Let’s look at the claim and see if it stands up to analysis.

## Marc Rummy’s diagram

Marc Rummy created this diagram to explain what’s going on with COVID hospitalizations. He’s made it free to share, which is fantastic.

In this diagram, the majority of the population is vaccinated (91%). The hospitalization rate for the unvaccinated is 50% but for the vaccinated, it’s 10%. If the total population is 110, this leads to 5 unvaccinated people hospitalized and 10 vaccinated people hospitalized - in other words, 2/3 of those in hospital with COVID have been vaccinated.

## Explaining the result

Let’s imagine we just looked at hospitalizations: 5 unvaccinated and 10 vaccinated. This makes it look like vaccinations aren’t working – after all, the majority of people in hospital are vaccinated. You can almost hear ignorant journalists writing their headlines now (“Questions were raised about vaccine effectiveness when the health minister revealed the majority of patients hospitalized had been vaccinated.”). But you can also see anti-vaxxers seizing on these numbers to try and make a point about not getting vaccinated.

The reason why the numbers are the way they are is because the great majority of people are vaccinated

Let’s look at three different scenarios with the same population of 110 people and the same hospitalization rates for vaccinated and unvaccinated:

• 0% vaccinated – 55 people hospitalized
• 91% vaccinated – 15 people hospitalized
• 100% vaccinated – 11 people hospitalized

Clearly, vaccinations reduce the number of hospitalizations. The anti-vaccine argument seems to be, if it doesn't reduce the risk to zero, it doesn't work - which is a strikingly weak and ignorant argument.

In this example, vaccination doesn’t reduce the risk of infection to zero, it reduces it by a factor of 5. In the real world, vaccination reduces the risk of infection by 5x and the risk of death due to COVID by 13x (https://www.nytimes.com/interactive/2021/us/covid-cases.html). The majority of people hospitalized now appear to be unvaccinated even though vaccination rates are only just above 60% in most countries (https://www.nytimes.com/interactive/2021/world/covid-cases.html, https://www.masslive.com/coronavirus/2021/09/breakthrough-covid-cases-in-massachusetts-up-to-about-40-while-unvaccinated-people-dominate-hospitalizations.html).

The bottom line is very simple: if you want to reduce your risk of hospitalization and protect your family and community, get vaccinated.

## The base rate fallacy

The mistake the anti-vaxxers and some journalists are making is a very common one, it’s called the base rate fallacy (https://thedecisionlab.com/biases/base-rate-fallacy/). There are lots of definitions online, so I’ll just attempt a summary here: “the base rate fallacy is where someone draws an incorrect conclusion because they didn’t take into account the base rate in the general population. It’s especially a problem for conditional probability problems.”

Let’s use another example from a previous blog post:

“Imagine there's a town of 10,000 people. 1% of the town's population has a disease. Fortunately, there's a very good test for the disease:

• If you have the disease, the test will give a positive result 99% of the time (sensitivity).
• If you don't have the disease, the test will give a negative result 99% of the time (specificity).

You go into the clinic one day and take the test. You get a positive result. What's the probability you have the disease?”

The reason why the answer is 50% and not 99% is because 99% of the town’s population does not have the disease (the base rate), which means half of the positives will be false positives.

## What’s to be done?

Conditional probability (for example, the COVID hospitalization data) is screwy and can sometimes seem counter to common sense. The general level of statistical (and probability) knowledge in the population is poor. This leaves people trying to make sense of the data around them but without the tools to do it, so no wonder they’re confused.

It’s probably time that all schoolchildren are taught some basic statistics. This should include some counter-intuitive results (for example, the disease example above). Even if very few schoolchildren grow up to analyze data, it would be beneficial for society if more people understood that interpreting data can be hard and that sometimes surprising results occur – but that doesn’t make them suspicious or wrong.

More importantly, journalists need to do a much better job of telling the truth and explaining the data instead of chasing cheap clicks.

# $$\beta$$ is $$\alpha$$ if there's an effect

In hypothesis testing, there are two kinds of errors:

• Type I - we say there's an effect when there isn't. The threshold here is $$\alpha$$.
• Type II - we say there's no effect when there really is an effect. The threshold here is $$\beta$$.

This blog post is all about explaining and calculating $$\beta$$.

# The null hypothesis

Let's say we do an A/B test to measure the effect of a change to a website. Our control branch is the A branch and the treatment branch is the B branch. We're going to measure the conversion rate $$C$$ on both branches. Here are our null and alternative hypotheses:

• $$H_0: C_B - C_A = 0$$ there is no difference between the branches
• $$H_1: C_B - C_A \neq 0$$ there is a difference between the branches

Remember, we don't know if there really is an effect, we're using procedures to make our best guess about whether there is an effect or not, but we could be wrong. We can say there is an effect when there isn't (Type I error) or we can say there is no effect when there is (Type II error).

Mathematically, we're taking the mean of thousands of samples so the central limit theorem (CLT) applies and we expect the quantity $$C_B - C_A$$ to be normally distributed. If there is no effect, then $$C_B - C_A = 0$$, if there is an effect $$C_B - C_A \neq 0$$.

# $$\alpha$$ in a picture

Let's assume there is no effect. We can plot out our expected probability distribution and define an acceptance region (blue, 95% of the distribution) and two rejection regions (red, 5% of the distribution). If our measured $$C_B - C_A$$ result lands in the blue region, we will accept the null hypothesis and say there is no effect, If our result lands in the red region, we'll reject the null hypothesis and say there is an effect. The red region is defined by $$\alpha$$.

One way of looking at the blue area is to think of it as a confidence interval around the mean $$x_0$$:

$\bar x_0 + z_\frac{\alpha}{2} s \; and \; \bar x_0 + z_{1-\frac{\alpha}{2}} s$

In this equation, s is the standard error in our measurement. The probability of a measurement $$x$$ lying in this range is:

$0.95 = P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ]$

If we transform our measurement $$x$$ to the standard normal $$z$$, and we're using a 95% acceptance region (boundaries given by $$z$$ values of 1.96 and -1.96), then we have for the null hypothesis:

$0.95 = P[-1.96 < z < 1.96]$

# $$\beta$$ in a picture

Now let's assume there is an effect. How likely is it that we'll say there's no effect when there really is an effect? This is the threshold $$\beta$$.

To draw this in pictures, I want to take a step back. We have two hypotheses:

• $$H_0: C_B - C_A = 0$$ there is no difference between the branches
• $$H_1: C_B - C_A \neq 0$$ there is a difference between the branches

We can draw a distribution for each of these hypotheses. Only one distribution will apply, but we don't know which one.

If the null hypothesis is true, the blue region is where our true negatives lie and the red region is where the false positives lie. The boundaries of the red/blue regions are set by $$\alpha$$. The value of $$\alpha$$ gives us the probability of a false positive.

If the alternate hypothesis is true, the true positives will be in the green region and the false negatives will be in the orange region. The boundary of the green/orange regions is set by $$\beta$$. The value of $$\beta$$ gives us the probability of a false negative.

# Calculating $$\beta$$

Calculating $$\beta$$ is calculating the orange area of the alternative hypothesis chart. The boundaries are set by $$\alpha$$ from the null hypothesis. This is a bit twisty, so I'm going to say it again with more words to make it easier to understand.

$$\beta$$ is about false negatives. A false negative occurs when there is an effect, but we say there isn't. When we say there isn't an effect, we're saying the null hypothesis is true. For us to say there isn't an effect, the measured result must lie in the blue region of the null hypothesis distribution.

To calculate $$\beta$$, we need to know what fraction of the alternate hypothesis lies in the acceptance region of the null hypothesis distribution.

Let's take an example so I can show you the process step by step.

1. Assuming the null hypothesis, set up the boundaries of the acceptance and rejection region. Assuming a 95% acceptance region and an estimated mean of x, this gives the acceptance region as:
$P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ]$ which is the mean and 95% confidence interval for the null hypothesis. Our measurement $$x$$ must lie between these bounds.
2. Now assume the alternate hypothesis is true. If the alternate hypothesis is true, then our mean is $$\bar x_1$$.
3. We're still using this equation from before, but this time, our distribution is the alternate hypothesis.
$P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] ]$
4. Transforming to the standard normal distribution using the formula $$z = \frac{x - \bar x_1}{\sigma}$$, we can write the probability $$\beta$$ as:
$\beta = P \left [ \frac{\bar x_0 + z_\frac{\alpha}{2} s - \bar x_1}{s} < z < \frac{ \bar x_0 + z_{1-\frac{\alpha}{2}} s - \bar x_1}{s} \right ]$

This time, let's put some numbers in.

• $$n = 200,000$$ (100,000 per branch)
• $$C_B = 0.062$$
• $$C_A = 0.06$$
• $$\bar x_0= 0$$ - the null hypothesis
• $$\bar x_1 = 0.002$$ - the alternate hypothesis
• $$s = 0.00107$$  - this comes from combining the standard errors of both branches, so $$s^2 = s_A^2 + s_B^2$$, and I'm using the usual formula for the standard error of a proportion, for example, $$s_A = \sqrt{\frac{C_A(1-C_A)}{n} }$$

Plugging them all in, this gives:
$\beta = P[ -3.829 < z < 0.090]$
which gives $$\beta = 0.536$$

# This is too hard

This process is complex and involves lots of steps. In my view, it's too complex. It feels to me that there must be an easier way of constructing tests. Bayesian statistics holds out the hope for a simpler approach, but widespread adoption of Bayesian statistics is probably a generation or two away. We're stuck with an overly complex process using very difficult language.

# Scientific fraud and business manipulation

Sadly, there's a long history of scientific fraud and misrepresentation of data. Modern computing technology has provided better tools for those trying to mislead, but the fortunate flip side is, modern tools provide ways of exposing misrepresented data. It turns out, the right tools can indicate what's really going on.

(Author: Nick Youngson. License: Creative Commons. Source: Wikimedia)

In business, companies often say they can increase sales, or reduce costs, or do so some other desirable thing. The evidence is sometimes in the form of summary statistics like means and standard deviations. Do you think you could assess the credibility of evidence based on the mean and standard deviation summary data alone?

In this blog post, I'm going to talk about how you can use one tool to investigate the credibility of mean and standard deviation evidence.

# Discrete quantities

Discrete quantities are quantities that can only take discrete values. An example is a count, for example, a count of the number of sales. You can have 0, 1, 2, 3... sales, but you can't have -1 sales or 563.27 sales.

Some business quantities are measured on scales of 1 to 5 or 1 to 10, for example, net promoter scores or employee satisfaction scores. These scales are often called Likert scales.

For our example, let's imagine a company is selling a product on the internet and asks its customers how likely they are to recommend the product. The recommendation is on a scale of 0 to 10, where 0 is very unlikely to recommend and 10 is very likely to recommend. This is obviously based on the net promoter idea, but I'm simplifying things here.

Very unlikely to recommend                   Very likely to recommend
0 1 2 3 4 5 6 7 8 9 10

Imagine the salesperson for the company tells you the results of a 500-person study are a mean of 9 and a standard deviation of 2.5. They tell you that customers love the product, but obviously, there's some variation. The standard deviation shows you that not everyone's satisfied and that the numbers are therefore credible.

But are these numbers really credible?

Stop for a second and think about it. It's quite possible that their customers love the product. A mean of 9 on a scale of 10 isn't perfection, and the standard deviation of 2.5 suggests there is some variation, which you would expect. Would you believe these numbers?

# Investigating credibility

We have three numbers; a mean, a standard deviation, and a sample size. Lots of different distributions could have given rise to these numbers, how can we backtrack to the original data?

The answer is, we can't fully backtrack, but we can investigate possibilities.

In 2018, a group of academic researchers in The Netherlands and the US released software you can use to backtrack to possible distributions from mean and standard deviation data. Their goal was to provide a tool to help investigate academic fraud. They wrote up how their software works and published it online, you can read their writeup here. They called their software SPRITE (Sample Parameter Reconstruction via Iterative TEchniques) and made it open-source, even going so far as to make a version of it available online. The software will show you the possible distributions that could give rise to the summary statistics you have.

One of the online versions is here. Let's plug in the salesperson's numbers to see if they're credible.

If you go to the SPRITE site, you'll see a menu on the left-hand side. In my screenshot, I've plugged in the numbers we have:

• Our scale goes from 0 to 10,
• Our mean is 9,
• Our standard deviation is 2.5,
• The number of samples is 500.
• We'll choose 2 decimal places for now
• We'll just see the top 9 possible distributions.

Here are the top 9 results.

Something doesn't smell right.  I would expect the data to show some form of more even distribution about the mean. For a mean of 9, I would expect there to be a number of 10s and a number of 8s too. These estimated distributions suggest that almost everyone is deliriously happy, with just a small handful of people unhappy. Is this credible in the real world? Probably not.

I don't have outright evidence of wrongdoing, but I'm now suspicious of the data. A good next step would be to ask for the underlying data. At the very least, I should view any other data the salesperson provides with suspicion. To be fair to the salesperson, they were probably provided with the data by someone else.

What if the salesperson had given me different numbers, for example, a mean of 8.5, a standard deviation of 1.2, and 100 samples? Looking at the results from SPRITE, the possible distributions seem much more likely. Yes, misrepresentation is still possible, but on the face of it, the data is credible.

# Did you spot the other problem?

There's another, more obvious problem with the data. The scale is from 0 to 10, but the results are a mean of 9 and a standard deviation of 2.5, which implies a confidence interval of 6.5 to 11.5. To state the obvious, the maximum score is 10 but the upper range of the confidence interval is 11.5. This type of mistake is very common and doesn't of itself indicate fraud. I'll blog more about this type of mistake later.

# What does this mean?

Due diligence is about checking claims for veracity before spending money. If there's a lot of money involved, it behooves the person doing the due diligence to check the consistency of the numbers they've been given. Tools like SPRITE are very helpful for sniffing out areas to check in more detail. However, just because a tool like SPRITE flags something up it doesn't mean to say there's fraud; people make mistakes with statistics all the time. However, if something is flagged up, you need to get to the bottom of it.

# What is truth?

Statistical testing is ultimately all about probabilities and thresholds for believing an effect is there or not. These thresholds and associated ideas are crucial to the decision-making process but are widely misunderstood and misapplied. In this blog post, I'm going to talk about three testing concepts: confidence, significance, and p-values; I'll deal with the hugely important topic of statistical power in a later post.

(peas - not p-values. Author: Malyadri, Source: Wikimedia Commons, License: Creative Commons)

# Types of error

To simplify, there are two kinds of errors in statistical testing:

• Type I - false positive. You say there's an effect when there isn't. This is decided by a threshold $$\alpha$$, usually set to 5%. $$\alpha$$ is called significance.
• Type II - false negative. You say there isn't an effect but there is. This is decided by a threshold $$\beta$$ but is usually expressed as the statistical power which is $$1 - \beta$$.

In this blog post, I'm going to talk about the first kind of error, Type I.

# Distribution and significance

Let's imagine you're running an A/B test on a website and you're measuring conversion on the A branch ($$c_A$$) and on the B branch ($$c_B$$). The null hypothesis is that there's no effect, which we can write as:

$H_0: c_A - c_B = 0$

This next piece is a little technical but bear with me. Tests of conversion are usually large tests (mostly, > 10,000 samples in practice). The conversion rate is the mean conversion for all website visitors. Because there are a large number of samples, and we're measuring a mean, the Central Limit Theorem (CLT) applies, which means the mean conversion rates will be normally distributed. By extension from the CLT, the quantity $$c_A - c_B$$ will also be normally distributed. If we could take many measurements of $$c_A - c_B$$ and the null hypothesis was true, we would theoretically expect the results to look like something like this.

Look closely at the chart. Although I've cut off the x-axis, the values go off to $$\pm \infty$$. If all values of $$c_A - c_B$$ are possible, how can we reject the null hypothesis and say there's an effect?

# Significance - $$\alpha$$

To decide if there's an effect there, we use a threshold. This threshold is referred to as the level of significance and is called $$\alpha$$. It's usually set at the 0.05 or 5% level. Confusingly, sometimes people refer to confidence instead, which is 1 - significance, so a 5% significance level corresponds to a 95% confidence level.

In the chart below, I've colored the 95% region around the mean value blue and the 5% region (2.5% at each end) red. The blue region is called the acceptance region and the red region is called the rejection region.

What we do is compare the measurement we actually make with the chart. If our measurement lands in the red zone, we decide there's an effect there (reject the null), if our measurement lands in the blue zone, we'll decide there isn't an effect there (fail to reject the null or 'accept' the null).

# One-sided or two-sided tests

On the chart with the blue and the red region, there are two rejection (red) regions. This means we'll reject the null hypothesis if we get a value that's more than a threshold above or below our null value. In most tests, this is what we want; we're trying to detect if there's an effect there or not and the effect can be positive or negative. This is called a two-sided test because we can reject the null in two ways (too negative, too positive).

But sometimes, we only want to detect if the treatment effect is bigger than the control. This is called a one-sided test. Technically, the null hypothesis in this case is:

$H_0: c_A - c_B \leq 0$

Graphically, it looks like this:

So we'll reject the null hypothesis only if our measured value lands in the red region on the right. Because there's only one rejection region and it's on one side of the chart, we call this a one-sided test.

# p-values

I've very blithely talked about measured values landing in the rejection region or not. In practice, that's not what we do; in practice, we use p-values.

Let's say we measured some value x. What's the probability we would measure this value if the null hypothesis were true (in other words, if there were no effect)? Technically, zero because the distribution is continuous, but that isn't helpful. Let's try a more helpful form of words. Assuming the null hypothesis is true, what's the probability we would see a value of x or a more extreme value? Graphically, this looks something like the green area on the chart below.

Let me say again what the p-value is. If there's no effect at all, it's the probability we would see the result (or a more extreme result) that we measured, or how likely is it that our measurement could have been due to chance alone?

The chart below shows the probability the null hypothesis is true; it shows the acceptance region (blue), the rejection region (red), and the measurement p-value (green). The p-value is in the rejection region, so we'll reject the null hypothesis in this case. If the green overlapped with the blue region, we would accept (fail to reject) the null hypothesis.

# Misunderstandings

There are some common misunderstandings around testing that can have quite bad commercial effects.

• 95% confidence is too high a bar - we should drop the threshold to 90%. In effect, this means you'll accept a lot of changes that have no effect. This will reduce the overall effectiveness of your testing program (see this prior blog post for an explanation).
• One-sided tests are OK and give a smaller sample size, so we should use them. This is true, but it's often important to determine if a change is having a negative effect. In general, hypothesis testing tests a single hypothesis, but sadly, people try and read more into test results than they should and want to answer several questions with a single question.
• p-values represent the probability of an effect being present. This is just not true at all.
• A small p-value indicates a big effect. p-values do not make any indication about the size of an effect; a low p-value does not mean there's a big effect.

# Practical tensions

In practice, there can be considerable tension between business and technical people over statistical tests. A lot of statistical practices (e.g. 5% significance levels, two-sided testing) are based on experience built up over a long time. Unfortunately, this all sounds very academic to the business person who needs results now and wants to take shortcuts. Sadly, in the long run, shortcuts always catch you up. There's an old saying that's very true: "there ain't no such thing as a free lunch."

# Why use the Poisson distribution?

Because it has properties that make it great to work with, data scientists use the Poisson distribution to model different kinds of counting data. But these properties can be seductive, and sometimes people model data using the Poisson distribution when they shouldn't. In this blog post, I'll explain why the Poisson distribution is so popular and why you should think twice before using it.

(Siméon-Denis Poisson by E. Marcellot, Public domain, via Wikimedia Commons)

# Poisson processes

The Poisson distribution is a discrete event probability distribution used to model events created using a Poisson process. Drilling down a level, a Poisson process is a series of events that have these properties:

• They occur at random but at a constant mean rate,
• They are independent of one another,
• Two (or more) events can't occur at the same time

Good examples of Poisson processes are website visits, radioactive decay, and calls to a help center.

# The properties of a Poisson distribution

Mathematically, the Poisson probability mass function looks like this:

$P_r (X=k) = \frac{\lambda^k e^{- \lambda}}{k!}$

where

• k is the number of events (always an integer)
• $$\lambda$$ is the mean value (or expected rate)

It's a discrete distribution, so it's only defined for integer values of $$k$$.

Graphically, it looks like this for $$\lambda=6$$. Note that it isn't symmetrical and it stops at 0, you can't have -1 events.

(Let's imagine we were modeling calls per hour in a call center. In this case, $$k$$ is the measured calls per hour, $$P$$ is their frequency of occurrence, and $$\lambda$$ is the mean number of calls per hour).

Here are some of the Poisson distribution's properties:

• Mean: $$\lambda$$
• Variance: $$\lambda$$
• Mode: floor($$\lambda$$)

The fact that some of the key properties are given by $$\lambda$$ alone makes using it easy. If your data follows a Poisson distribution, once you know the mean value, you've got the variance (and standard deviation), and the mode too. In fact, you've pretty much got a full description of your data's distribution with just a single number.

# When to use it and when not to use it

Because you can describe the entire distribution with just a single number, it's very tempting to assume that any data that involves counting follows a Poisson distribution because it makes analysis easier.  Sadly, not all counts follow a Poisson distribution. In the list below, which counts do you think might follow a Poisson distribution and which might not?

• The number of goals in English Premier League soccer matches.
• The number of earthquakes of at least a given size per year around the world.
• Bus arrivals.
• The number of web pages a person visits before they make a purchase.

Bus arrivals are not well modeled by a Poisson distribution because in practice they're not independent of one another and don't occur at a constant rate. Bus operators change bus frequencies throughout the day, with more buses scheduled at busy times; they may also hold buses at stops to even out arrival times. Interestingly, bus arrivals are one of the textbook examples of a Poisson process, which shows that you need to think before applying a model.

The number of web pages a person visits before they make a purchase is better modeled using a negative binomial distribution

Earthquakes are well-modeled by a Poisson distribution. Earthquakes in different parts of the world are independent of one another and geological forces are relatively constant, giving a constant mean rate for quakes. It's possible that two earthquakes could happen simultaneously in different parts of the world, which shows that even if one of the criteria might not apply, data can still be well-modeled by Poisson.

What about soccer matches? We know two goals can't happen at the same time. The length of matches is fixed and soccer is a low-scoring game, so the assumption of a constant rate for goals is probably OK. But what about independence? If you've watched enough soccer, you know that the energy level in a game steps up as soon as a goal is scored. Is this enough to violate the independence requirement? Apparently not, scores in soccer matches are well-modeled by a Poisson distribution.

# What should a data scientist do?

Just because the data you're modeling is a count doesn't mean it follows a Poisson distribution. More generally, you should be wary of making choices motivated by convenience. If you have count data, look at the properties of your data before deciding on a distribution to model it with.

# Why should you care about probability distributions?

Using the wrong probability distribution can be extremely expensive for businesses:

• for businesses using machinery (factories, vehicles, aircraft, etc.), it can lead to parts being changed too frequently or too infrequently
• for businesses relying on returning customers, it can lead to substantial under or over-estimates of revenue and/or targeting the wrong customers with promotions
• for businesses forecasting future sales by territory and/or product, it can lead to poor territory allocation or poor product resource allocation.

Given that it's so important, what is a probability distribution, and what are some examples?

# What's a probability distribution?

At its simplest, a probability distribution tells you how likely an outcome is given some input. For example, how is sales probability distributed by price, or how likely is a component to fail in the next month?

If something is certain to occur, the probability is 1, if it's certain not to occur, the probability is zero.  Let's imagine a component lasts a maximum of 6 months before failure. Our probability distribution might show the probability of failure on days 1 to 180. The sum of all failure probabilities for all days must sum to 1.

In the real world, data is noisy and we don't expect real data to exactly follow theoretical distributions, but given enough data, the match should be close enough for us to reason about what's going on.

# Uniform distribution - gambling and manufacturing

If the probability is the same for all input values, the distribution is uniform.

Let's imagine we're manufacturing candy, and we want to have equal numbers of red, blue, green, black, and white sweets in a packet. In theory, here's what we should observe.

But in reality, there's random noise so we might see something like this below. We can quantify the difference between the expected distribution and the actual distribution, which tells us something about the variability in the manufacturing process.

The uniform distribution also occurs in gambling, for example, lotteries or dice games.

Uniform distribution description by NIST

# Binomial distribution - pass/fail and conversion

Each customer who comes into a store or who visits a website will either buy or not buy, which we can turn into a conversion rate. We can model these kinds of pass/fail processes using the binomial distribution. Here's the probability distribution.

The binomial distribution shows us the probability of measuring different results given an underlying 'truth'. Let's imagine the 'true' conversion rate was 0.04, we might not measure 0.04 due to sampling error, instead, we might measure 0.045 or 0.055, depending on how many samples we take. It's important to understand what this means:

• There is uncertainty in our measurement.
• The smaller the sample, the bigger the uncertainty.

Although many technical people understand this, most non-technical people do not, which can lead to tension.

Yale stats

# Poisson distribution - waiting in line

Imagine you're a bank serving customers with ATMs at a location. ATMs are expensive, but you don't want to keep people waiting in long lines to do their transactions, it's bad for business. So how do you balance the cost of an ATM against its use? By modeling how many people are using the ATM over a time period.

It turns out, the number of people who visit an ATM over a time period can be modeled using the Poisson distribution, which I've shown below. This gives us a way of assessing how much variation there might be in usage and therefore how many machines we might want to install.

The Poisson distribution is often used to model counting processes. It's very attractive because it has an unusual feature, the standard deviation for the distribution is $$\sqrt{\gamma}$$ where $$\gamma$$ is the mean. Unfortunately, this property makes it a little too attractive; it's sometimes used when it shouldn't be.

The Poisson Distribution and Poisson Process Explained

# Exponential distribution

How long does a car battery last? How long do phone calls last? When will the next earthquake occur? These durations typically follow the exponential distribution (which is strongly related to the Poisson distribution). I've shown this distribution below.

The exponential distribution

# Power law distribution - finding fraud

How are incomes distributed in a population? How might you find fraud in the pattern of digits in expenses? It turns out, the distribution of the first digits in invoices follows a power-law distribution. The chart below shows a generic power-law distribution - for fraud detection, it's 'flipped'.

Power law distribution

# Normal distribution - almost everywhere, but not quite

What's the probability distribution for male soldiers' chest measurements? How are the results of A/B tests distributed? What about the distribution of measurement errors? All these, and many, many more follow the normal distribution, which is also called the Gaussian distribution or the bell curve. If you only learn one distribution, this is the one to learn.

The properties of this distribution are extremely well-known, and every student of statistics and probability theory will know them. It's ubiquitous because of something called the Central Limit Theorem, which, simplifying a great deal, says that the sum of samples from any distribution follows a normal distribution.

Because it's everywhere, for some people, it's the only distribution they know. Like the old saying goes, if you only have a hammer, every problem is a nail. This distribution can be over-used, with bad consequences.

Here's the distribution. It ought to look familiar.

The normal distribution

# Lognormal distribution

How long do visitors spend on web pages? What about the distribution of internet traffic? Or the distribution of city sizes? These all follow a log-normal distribution that looks like the example below. The lognormal distribution is quite common in business.

Note the 'fat tail' or 'long tail' on the right-hand side. Many businesses have been caught out because they assumed sales or market risk followed a normal distribution when in fact they followed a lognormal distribution.

There's a variation of the Central Limit Theorem that yields log-normal distributions instead of normal distributions.

# Other distributions

There are lots and lots of different distributions. I saw a list of 90 the other day. Almost all of them are esoteric and apply in a very limited set of cases. You don't have to know all of them but you should be aware that choosing the right distribution is important to make the correct estimates. The distributions I've listed in this blog post are probably the most important, and you should know them and their properties.

As you asked nicely, here is a list of some distributions.

Alpha Distribution
Anglit Distribution
Arcsine Distribution
Beta Distribution
Beta Prime Distribution
Burr Distribution
Burr12 Distribution
Cauchy Distribution
Chi Distribution
Chi-squared Distribution
Cosine Distribution
Double Gamma Distribution
Double Weibull Distribution
Erlang Distribution
Exponential Distribution
Exponentiated Weibull Distribution
Exponential Power Distribution
Fatigue Life (Birnbaum-Saunders) Distribution
Fisk (Log Logistic) Distribution
Folded Cauchy Distribution
Folded Normal Distribution
Fratio (or F) Distribution
Gamma Distribution
Generalized Logistic Distribution
Generalized Pareto Distribution
Generalized Exponential Distribution
Generalized Extreme Value Distribution
Generalized Gamma Distribution
Generalized Half-Logistic Distribution
Generalized Inverse Gaussian Distribution
Generalized Normal Distribution
Gilbrat Distribution
Gompertz (Truncated Gumbel) Distribution
Gumbel (LogWeibull, Fisher-Tippetts, Type I Extreme Value) Distribution
Gumbel Left-skewed (for minimum order statistic) Distribution
HalfCauchy Distribution
HalfNormal Distribution
Half-Logistic Distribution
Hyperbolic Secant Distribution
Gauss Hypergeometric Distribution
Inverted Gamma Distribution
Inverse Normal (Inverse Gaussian) Distribution
Inverted Weibull Distribution
Johnson SB Distribution
Johnson SU Distribution
KSone Distribution
KStwo Distribution
KStwobign Distribution
Laplace (Double Exponential, Bilateral Exponential) Distribution
Left-skewed Lévy Distribution
Lévy Distribution
Logistic (Sech-squared) Distribution
Log Double Exponential (Log-Laplace) Distribution
Log Gamma Distribution
Log Normal (Cobb-Douglass) Distribution
Log-Uniform Distribution
Maxwell Distribution
Mielke’s Beta-Kappa Distribution
Nakagami Distribution
Noncentral chi-squared Distribution
Noncentral F Distribution
Noncentral t Distribution
Normal Distribution
Normal Inverse Gaussian Distribution
Pareto Distribution
Pareto Second Kind (Lomax) Distribution
Power Log Normal Distribution
Power Normal Distribution
Power-function Distribution
R-distribution Distribution
Rayleigh Distribution
Rice Distribution
Reciprocal Inverse Gaussian Distribution
Semicircular Distribution
Student t Distribution
Trapezoidal Distribution
Triangular Distribution
Truncated Exponential Distribution
Truncated Normal Distribution
Tukey-Lambda Distribution
Uniform Distribution
Von Mises Distribution
Wald Distribution
Weibull Maximum Extreme Value Distribution
Weibull Minimum Extreme Value Distribution
Wrapped Cauchy Distribution

# Continuous or discrete - shaken or stirred?

Some quantities are discrete and some are continuous. A discrete quantity is something like a sales territory (e.g. Germany, Ireland, Spain) or customer count (you can't have 0.5 of a customer). A continuous quantity can take any value, for example, speed can be 45.2 kph, 120.01 kph, and so on. Some distributions apply to both continuous and discrete, and some apply only to continuous or discrete. To muddy the waters, sometimes continuous distributions are used to approximately model discrete quantities.

## Vehicles

Imagine you're running a delivery vehicle fleet. You need to keep your vehicles on the road, but you need to keep an eye on maintenance costs. You decide to use math to guide your decisions, so you work out the average lifetime for different components. You have two components A and B with the same lifetimes in miles. If either component fails, you have to tow the vehicle, which is very expensive.

• Component A. Lifetime is 150,000 miles.
• Component B. Lifetime is 150,000 miles.

A vehicle comes in for maintenance with 149,000 miles on the odometer. Should you replace components A and B?

As you might expect, there's a gotcha. Without knowing the probability distribution for failures, we can't make these decisions. For example, a windshield might have a uniform failure rate distribution, with the probability of failure for miles 1-100 the same as the probability of failure for miles 100,000-100,100. A clutch may have a failure rate that increases with mileage, the probability of failure at miles 100,000-100,100 being much higher than the probability of failure at miles 0-100. Because we know what a clutch and a windshield are, we might decide to replace the clutch and leave the windshield. But what if A and B were a serpentine belt and a heat shield?

The only way to make rational decisions is to understand what distribution the probability of failure follows, which may well be very different for different components (e.g. car seats vs. tires).

## Marketing

A new analyst is studying the market for luxury goods in Germany. They have partial data for the fraction of the population that have a certain income. Using what they have, they assume their data is normally distributed and they make a forecast for the fraction of the population that will have an income high enough to afford luxury items. Do you think their forecast will be too low, just right, or too high?

Incomes are usually log-normally distributed, so the analyst, in this case, has chosen the wrong distribution. Because the lognormal has a very long right tail, the analyst's estimate is likely to be an underestimate and may be substantially out. A competitor might not make the same mistake.

# Takeaways

I've interviewed people who claim data science on their resumes, but only know the normal distribution. If you assume your data is normal, when in reality it's log-normal or Poisson, things are going to go badly wrong for you. Any analyst in business needs to be very comfortable with different distributions and needs to know which may be applicable and when.

# Does God play dice with the universe?

Imagine I gave you an ordinary die, not special in any way, and I asked you to throw the die and record your results (how many 1s, how many 2s, etc.). What would you expect the results to be? Do you think you could win by choosing some numbers rather than others? Are you sure?

(Image source: Wikimedia Commons. Author: Diacritica. License: Creative Commons.)

# What you might expect

Let's say you thew the die 12,000 times, you might expect a probability distribution something like this. This is a uniform distribution where all results are equally likely.

You know you'll never get an absolutely perfect distribution, so in reality, your results might look something like this for 12,000 throws.

The deviations from the expected values are random noise that we can quantify. Further, we know that by adding more dice throws, random noise gets less and less and we approach the ideal uniform distribution more closely.

I've simulated dice throws in the plots below, the top chart is 12,000 throws and the chart on the bottom is 120,000 throws. The blue bars represent the actual results, the black circle represents the expected value, and the black line is the 95% confidence interval. Note how the results for 120,000 throws are closer to the ideal than the results from 12,000 throws.

# What happened in reality - not what you expect

My results are simulations, but what happens when you throw dice thousands of times in the real world?

There's a short history of probability theorists and statisticians throwing dice and recording the results.

• Weldon threw 12 dice 26,306 times by hand and sent the results to his friend Francis Galton.
• Iversen ran an experiment where 219 dice were rolled 20,000 times.

Weldon's data set is widely used to illustrate statistical concepts, especially after Pearson used it to explain his $$\chi^2$$ technique in 1900.

Despite the excitement you see at the craps tables in Las Vegas, throwing dice thousands of times is dull and is, therefore, an ideal job for a computer. In 2009, Zachariah Labby created apparatus for throwing dice and recording the scores using a camera and image processing. You can read more about his apparatus and experimental setup here. He 'threw' 12 dice 26,306 times and his machine recorded the results.

In the chart below, the blue bars are his results, the black circle is the expected result, and the black line is the 95% confidence interval. I've taken the results from all 12 dice, so my throw count is $$12 \times 26,306$$.

This doesn't look like a uniform distribution. To state the obvious, 1 and 6 occurred more frequently than theory would suggest - the deviation from the uniform distribution is statistically significant. The dice he used were not special dice, they were off-the-shelf standard unbiased dice. What's going on?

# Unbiased dice are biased

Take a very close look at a normal die, the type pictured at the start of this post which is the kind of die you buy in shops.

By convention, opposite faces on dice sum to 7, so 1 is opposite 6, 3 is opposite to 4, and so on. Now look very closely again at the picture at the start of the post. Look at the dots on the face of the dice. Notice how they're indented. Each hole is the same size, but obviously, the number of holes on each face is different. Let's think of this in terms of weight. Imagine we could weigh each face of the dice. Let's pair up the faces, each side is paired with the face opposite it. Now let's weigh the faces and compare them.

The greatest imbalance in weights is the 1-6 combination. This imbalance is what's causing the bias.

Obviously, the bias is small, but if you roll the die enough times, even a small bias becomes obvious.

# Vegas here I come - or not...

So we know for dice bought in shops that 1 and 6 are ever so slightly more likely to occur than theory suggests. Now you know this, why aren't you booking your flight to Las Vegas? You could spend a week at the craps tables and make a little money.

Not so fast.

Let's look at the dice they use in Vegas.

(Image source: Wikimedia Commons. Author: Alper Atmaca License: Creative Commons.)

Notice that the dots are not indented. They're filled with colored material that's the same density as the rest of the dice. In other words, there's no imbalance, Vegas dice will give a uniform distribution, and 1 and 6 will occur as often as 2, 3, 4, or 5. You're going to have to keep punching the clock.

# Some theory

Things are going to get mathematical from here on in. There won't be any new stories about dice or Vegas.

How did I get the expected count and error bars for each dice score? Let's say I threw the dice $$x$$ times, it seems obvious we would get an expected count of $$\frac{x}{6}$$ for each score, but why? What about the standard error?

Let's re-think the dice as a Bernoulli trial. Let's choose a score, say 1. If we throw the dice and it shows a 1, we consider that a success. If it shows anything else, we consider it a failure. Because we have a Bernoulli trial, we can use the binomial distribution to model the results.

Using Wikipedia's notation:

• $$n$$ is the number of throws
• $$p$$ is the probability of getting a 1, which is $$\frac{1}{6}$$
• $$q = 1- p$$ is the probability of getting 2-6, which is $$\frac{5}{6}$$

So, again using Wikipedia's handy summary, for $$n$$ throws:

• The mean is $$np = 12 \times 26,306 \times \frac{1}{6} = 52,612$$
• The standard deviation is $$\sqrt{npq} = \sqrt{12 \times 26,306 \times \frac{1}{6} \times \frac{5}{6}} = 209.388$$
• The 95% confidence interval is $$52,202$$ to $$53,022$$ (standard deviation by 1.96).

# Publications

Academics live or die by publications and by citations of their publications. Labby's work has rightly been widely cited on the internet. I keep hoping that some academic will be inspired by Labby and use modern robotic technology and image recognition to do huge (million-plus) classical experiments, like tossing coins or selecting balls from an urn. It seems like an easy win to be widely cited!