Showing posts with label statistics. Show all posts

Monday, March 10, 2025

Everything you wanted to know about the normal distribution but were afraid to ask

Normal is all around you, and so is not-normal

The normal distribution is the most important statistical distribution. In this blog post, I'm going to talk about its properties, where it occurs, and why it's so very important. I'm also going to talk about how using the normal distribution when you shouldn't can lead to disaster and what you can do about it.

(Ainali, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons)

A rose by any other name

The normal distribution has a number of different names in different disciplines:

Normal distribution. This is the name used by statisticians and data scientists.
Gaussian distribution. This is what physicists call it.
The bell curve. The names used by social scientists and by people who don't understand statistics.

I'm going to call it the normal distribution in this blog post, and I'd advise you to call it this too. Even if you're not a data scientist, using the most appropriate name helps with communication.

What it is and what it looks like

When we're measuring things in the real world, we see different values. For example, if we measure the heights of 10 year old boys in a town, we'd see some tall boys, some short boys, and most boys around the "average" height. We can work out what fraction of boys are a certain height and plot a chart of frequency on the y axis and height on the x axis. This gives us a probability or frequency distribution. There are many, many different types of probability distribution, but the normal distribution is the most important.

(As an aside, you may remember making histograms at school. These are "sort-of" probability distributions. For example, you might have recorded the height of all the children in a class, grouped them into height ranges, counted the number of children in each height range, and plotted the chart. The y axis would have been a count of how many children in that height range. To turn this into a probability distribution, the y axis would become the fraction of all children in that height range. )

Here's what a normal probability distribution looks like. Yes, it's the classic bell curve shape which is exactly symmetrical.

The formula describing the curve is quite complex, but all you need to know for now is that it's described by two numbers: the mean (often written \(\mu\)) and a standard deviation (often written \(\sigma\)). The mean tells you where the peak is and the standard deviation gives you a measure of the width of the curve.

To greatly summarize: values near the mean are the most likely to occur and the further you go from the mean, the less likely they are. This lines up with our boys' heights example: there aren't many very short or very tall boys and most boys are around the mean height.

Obviously, if you change the mean or the standard deviation, you change the curve, for example, you can change the location of the mean or you can make the curve wider or narrower. It turns out changing the mean and standard deviation just scales the curve because of its underlying mathematical properties. Most distributions don't behave like this; changing parameters can greatly change the entire shape of the distribution (for example, the beta distribution wildly changes shape if you change its parameters). The normal scaling property has some profound consequences, but for now, I'll just focus on one. We can easily map all normal distributions to one standard normal distribution. Because the properties of the standard normal are known, we can easily do math on the standard normal. To put it another way, it greatly speeds up what we need to do.

Why the normal distribution is so important

Here are some normal distribution examples from the real world.

Let's say you're producing precision bolts. You need to supply 1,000 bolts of a precise specification to a customer. Your production process has some variability. How many bolts do you need to manufacture to get 1,000 good ones? If you can describe the variability using a normal distribution (which is the case for many manufacturing processes), you can work out how many you need to produce.

Imagine you're outfitting an army and you're buying boots. You want to buy the minimum number of boots while still fitting everyone. You know that many body dimensions follow the normal distribution (most famously, chest circumference), so you can make a good estimate of how many boots of different sizes to buy.

Finally, let's say you've bought some random stocks. What might the daily change in value be? Under usual conditions, the change in value follows a normal distribution, so you can estimate what your portfolio might be worth tomorrow.

It's not just these three examples, many phenomena in different disciplines are well described by the normal distribution.

The normal distribution is also common because of something called the central limit theorem (CLT). Let's say I'm taking measurement samples from a population, e.g. measuring the speed of cars on a freeway. The CLT says that the distribution of the sample means will follow a normal distribution regardless of the underlying distribution. In the car speed example, I don't know how the speeds are distributed, but I can calculate a mean and know how certain I am that the mean value is the true (population) mean. This sounds a bit abstract but it has profound consequences in statistics and means that normal distribution comes up time and time again.

Finally, it's important because it's so well-known. The math to describe and use the normal distribution has been known for centuries. It's been written about in hundreds of textbooks in different languages. More importantly, it's very widely taught; almost all numerate degrees will cover it and how to use it.

Let's summarize why it's important:

It comes up in nature, in finance, in manufacturing etc.
It comes up because of the CLT.
The math to use it is standardized and well-known.

What useful things can I do with the normal distribution?

Let's take an example from the insurance world. Imagine an insurance company insures house contents and cars. Now imagine the claim distribution for cars follows a normal distribution and the claims distribution for house contents also follows a normal distribution. Let's say in a typical year the claims distributions look something like this (cars on the left, houses on the right).

(The two charts look identical except for the numbers on the x and y axis. That's expected. I said before that all normal distributions are just scaled versions of the standard normal. Another way of saying this is, all normal distribution plots look the same.)

What does the distribution look like for cars plus houses?

The long winded answer is to use convolution (or even Monte Carlo). But because the house and car distribution are normal, we can just do:

\(\mu_{combined} = \mu_{houses} + \mu_{cars} \)

\(\sigma_{combined}^2 = \sigma_{houses}^2 + \sigma_{cars}^2\)

So we can calculate the combined distribution in a heartbeat. The combined distribution looks like this (another normal distribution, just with a different mean and standard deviation).

To be clear: this only works because the two distributions were normal.

It's not just adding distributions together. The normal distribution allows for shortcuts if we're multiplying or dividing etc. The normal distribution makes things that would otherwise be hard very fast and very easy.

Some properties of the normal distribution

I'm not going to dig into the math here, but I am going to point out a few things about the distribution you should be aware of.

The "standard normal" distribution goes from \(-\infty\) to \(+\infty\). The further away you get from the mean, the lower the probability, and once you go several standard deviations away, the probability is quite small, but never-the-less, it's still present. Of course, you can't show \(\infty\) on a chart, so most people cut off the x-axis at some convenient point. This might give the misleading impression that there's an upper or lower x-value; there isn't. If your data has upper or lower cut-off values, be very careful modeling it using a normal distribution. In this case, you should investigate other distributions like the truncated normal.

The normal distribution models continuous variables, e.g. variables like speed or height that can have any number of decimal places (but see the my previous paragraph on \(\infty\)). However, it's often used to model discrete variables (e.g. number of sheep, number of runs scored, etc.). In practice, this is mostly OK, but again, I suggest caution.

Abuses of the normal distribution and what you can do

Because it's so widely known and so simple to use, people have used it where they really shouldn't. There's a temptation to assume the normal when you really don't know what the underlying distribution is. That can lead to disaster.

In the financial markets, people have used the normal distribution to predict day-to-day variability. The normal distribution predicts that large changes will occur with very low probability; these are often called "black swan events". However, if the distribution isn't normal, "black swan events" can occur far more frequently than the normal distribution would predict. The reality is, financial market distributions are often not normal. This creates opportunities and risks. The assumption of normality has lead to bankruptcies.

Assuming normality can lead to models making weird or impossible predictions. Let's say I assume the numbers of units sold for a product is normally distributed. Using previous years' sales, I forecast unit sales next year to be 1,000 units with a standard deviation of 500 units. I then create a Monte Carlo model to forecast next years' profits. Can you see what can go wrong here? Monte Carlo modeling uses random numbers. In the sales forecast example, there's a 2.28% chance the model will select a negative sales number which is clearly impossible. Given that Monte Carlo models often use tens of thousands of simulations, it's extremely likely the final calculation will have been affected by impossible numbers. This kind of mistake is insidious and hard to spot and even experienced analysts make it.

If you're a manager, you need to understand how your team has modeled data.

Ask what distributions they've used to model their data.
Ask them why they've used that distribution and what evidence they have that the data really is distributed that way.
Ask them how they're going to check their assumptions.
Most importantly, ask them if they have any detection mechanism in place to check for deviation from their expected distribution.

History - where the normal came from

Rather unsatisfactorily, there's no clear "Eureka!" moment for the discovery of the distribution, it seems to have been the accumulation of the work of a number of mathematicians. Abraham de Moivre kicked off the process in 1733 but didn't formalize the distribution, leaving Gauss to explicitly describe it in 1801 [https://medium.com/@will.a.sundstrom/the-origins-of-the-normal-distribution-f64e1575de29].

Gauss used the normal distribution to model measurement errors and so predict the path of the asteroid Ceres [https://en.wikipedia.org/wiki/Normal_distribution#History]. This sounds a bit esoteric, but there's a point here that's still relevant. Any measurement taking process involves some form of error. Assuming no systemic bias, these errors are well-modeled by the normal distribution. So any unbiased measurement taking today (e.g opinion polling, measurements of particle mass, measurement of precision bolts, etc.) uses the normal distribution to calculate uncertainty.

In 1810, Laplace placed the normal distribution at the center of statistics by formulating the Central Limit Theorem.

The math

The probability distribution function is given by:

\[f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e ^ {-\frac{1}{2} ( \frac{x - \mu}{\sigma}) ^ 2 }\]

\(\sigma\) is the standard deviation and \(\mu\) is the mean. In the normal distribution, the mean is the same as the mode is the same as the median.

This formula is almost impossible to work with directly, but you don't need to. There are extensive libraries that will do all the calculations for you.

Adding normally distributed parameters is easy:

\(\mu_{combined} = \mu_{houses} + \mu_{cars} \)

\(\sigma_{combined}^2 = \sigma_{houses}^2 + \sigma_{cars}^2\)

Wikipedia has an article on how to combine normally distributed quantities, e.g. addition, multiplication etc. see https://en.wikipedia.org/wiki/Propagation_of_uncertainty.

Thursday, February 13, 2025

Why assuming independence can be very, very bad for business

Independence in probability

Why should I care about independence?

Many models in the finance industry and elsewhere assume events are independent. When this assumptions fails, catastrophic losses can occur as we saw in 2008 and 1992. The problem is, developers and data scientists assume independence because it greatly simplifies problems, but the executive team often don't know this has happened, or even worse, don't understand what it means. As a result, the company ends up being badly caught out when circumstances change and independence no longer applies.

(Sergio Boscaino from Busseto, Italy, CC BY 2.0 , via Wikimedia Commons)

In this post, I'm going to explain what independence is, why people assume it, and how it can go spectacularly wrong. I'll provide a some guidance for managers so they know the right questions to ask to avoid disaster. I've pushed the math to the end, so if math isn't your thing, you can leave early and still get benefit.

What is independence?

Two events are independent if the outcome of one doesn't affect the other in any way. For example, if I throw two dice, the probability of me throwing a six on the second die isn't affected in any way by what I throw on the first die.

Here are some examples of independent events:

Throwing a coin and getting a head, throwing a dice and getting a two.
Drawing a king from a deck of cards, winning the lottery having bought a ticket.
Stopping at at least one red light on my way to the store, rain falling two months from now.

By contrast, some events are not independent (they're dependent):

Raining today and raining tomorrow. Rain today increases the chances of rain tomorrow.
Heavy snow today and a football match being played. Heavy snow will cause the match to be postponed.
Drawing a king from a deck of cards, then without replacing the card, drawing a king on the second draw.

Why people assume independence

People assume independence because the math is a lot, lot simpler. If two events are dependent, the analyst has to figure out the relationship between them, something that can be very challenging and time consuming to do. Other than knowing there's a relationship, the analyst might have no idea what it is and there may be no literature to guide them. For example, we know that smoking increases the risk of lung cancer (and therefore a life insurance claim), so how should an actuary price that risk? If they price it too low, the insurance company will pay out too much in claims, if they price it too high, the insurance company will lose business to cheaper competitors. In the early days when the link between smoking and cancer was discovered, how could an actuary know how to model the relationship?

Sometimes, analysts assume independence because they don't know any better. If they're not savvy about probability theory, they may do a simple internet search on combining probabilities that will suggest all they have to do is multiple probabilities, which is misleading at best. I believe people are making this mistake in practice because I've interviewed candidates with MS degrees in statistics who made this kind of blunder.

Money and fear can also drive the choice to assume independence. Imagine you're an analyst. Your manager is on your back to deliver a model as soon as possible. If you assume independence, your model will be available on time and you'll get your bonus, if you don't, you won't hit your deadline and you won't get your bonus. Now imagine the bad consequences of assuming independence won't be visible for a while. What would you do?

Harder examples

Do you think the following are independent?

Two unrelated people in different towns defaulting on their mortgage at the same time

Houses in different towns suffering catastrophic damage (e.g. fire, flood, etc.)

Most of the time, these events will be independent. For example, a house burning down because of poor wiring doesn't tell you anything about the risk of a house burning down in a different town (assuming a different electrician!). But there are circumstances when the independence assumption fails:

A hurricane hits multiple towns at once causing widespread catastrophic damage in different insurance categories (e.g. hurricane Andrew in 1992).

A recession hits, causing widespread lay-offs and mortgage defaults, especially for sub-prime mortgages (2008).

Why independence fails

Prior to 1992, the insurance industry had relatively simple risk models. They assumed independence; an assumption that worked well for some time. In an average year, they knew roughly how many claims there would be for houses, cars etc. Car insurance claims were independent of house insurance claims that in turn were independent of municipal and corporate insurance claims and so on.

When hurricane Andrew hit Florida in 1992, it destroyed houses, cars, schools, hospitals etc. across multiple towns. The assumption of independence just wasn't true in this case. The insurance claims were sky high and bankrupted several companies.

(Hurricane Andrew, houses destroyed in Dade County, Miami. Image from FEMA. Source: https://commons.wikimedia.org/wiki/File:Hurricane_andrew_fema_2563.jpg)

To put it simply, the insurance computer models didn't adequately model the risk because they had independence baked in.

Roll forward 15 years and something similar happened in the financial markets. Sub-prime mortgage lending was build on a set of assumptions, including default rates. The assumption was, mortgage defaults were independent of one another. Unfortunately, as the 2008 financial crisis hit, this was no longer valid. As more people were laid-off, the economy went down, so more people were laid-off. This was often called contagion but perhaps a better analogy is the reverse of a well known saying: "a rising tide floats all boats".

(Image credit: Secret London 123, CC BY-SA 2.0, via Wikimedia Commons)

The assumption of independence simplified the analysis of sub-prime mortgages and gave the results that people wanted. The incentives weren't there to price in risk. Imagine your company was making money hand over fist and you stood up and warned people of the risks of assuming independence. Would you put your bonus and your future on the line to do so?

What to do - recommendations

Let's live in the real world and accept that assuming independence gets us to results that are usable by others quickly.

If you're a developer or a data scientist, you must understand the consequences of assuming independence and you must recognize that you're making that assumption. You must also make it clear what you've done to your management.

If you're a manager, you must be aware that assuming independence can be dangerous but that it gets results quickly. You need to ask your development team about the assumptions they're making and when those assumptions fail. It also means accepting your role as a risk manager; that means investing in development to remove independence.

To get results quickly, it may well be necessary for an analyst to assume independence. Once they've built the initial model (a proof of concept) and the money is coming in, then the task is to remove the independence assumption piece-by-piece. The mistake is to stop development.

The math

Let's say we have two events, A and B, with probabilities of occurring P(A) and P(B).

If the events are independent, then the probability of them both occurring is:

\[P(A \ and \ B) = P(A \cap B) = P(A) P(B)\]

This equation serves as both a definition of independence and test of independence as we'll see next.

Let's take two cases and see if they're independent:

Rolling a dice and getting a 1 and a 2
Rolling a dice and getting a (1 or 2) and (2, 4, or 6)

For case 1, here are the probabilities:

\(P(A) = 1/6\)
\(P(B) = 1/6\)
\(P(A \cap B) = 0\), it's not possible to get 1 and 2 at the same time
\(P(A )P(B) = (1/6) * (1/6)\)

So the equation \(P(A \ and \ B) = P(A \cap B) = P(A) P(B)\) isn't true, therefore the events are not independent.

For case 2, here are the probabilities:

\(P(A) = 1/3\)
\(P(B) = 1/2\)
\(P(A \cap B) = 1/6\)
\(P(A )P(B) = (1/2) * (1/3)\)

So the equation is true, therefore the events are independent.

Dependence uses conditional probability, so we have this kind of relationship:

\[P(A \ and \ B) = P(A \cap B) = P(A | B) P(B)\]

The expression \(P(A | B)\) means the probability of A given that B has occurred (e.g the probability the game is canceled given that it's snowed). There are a number of ways to approach finding \(P(A | B)\), the most popular over the last few years has been Bayes' Theorem which states:

\[P(A | B) = \frac{P(B | A) P(A)}{P(B)}\]

There's a whole methodology that goes with the Bayesian approach and I'm not going to go into it here, except to say that it's often iterative; we make an initial guess and progressively refine it in the light of new evidence. The bottom line is, this process is much, much harder and much more expensive than assuming independence.

Monday, July 31, 2023

Essential business knowledge: the Central Limit Theorem

Knowing the Central Limit Theorem means avoiding costly mistakes

I've spoken to well-meaning analysts who've made significant mistakes because they don't understand the implications of one of the core principles of statistics; the Central Limit Theorem (CLT). These errors weren't trivial either, they affected salesperson compensation and the analysis of A/B tests. More personally, I've interviewed experienced candidates who made fundamental blunders because they didn't understand what this theorem implies.

The CLT is why the mean and standard deviation work pretty much all the time but it's also why they only work when the sample size is "big enough". It's why when you're estimating the population mean it's important to have as large a sample size as you can. It's why we use the Student's t-test for small sample sizes and why other tests are appropriate for large sample sizes.

In this blog post, I'm going to explain what the CLT is, some of the theory behind it (at a simple level), and how it drives key business statistics. Because I'm trying to communicate some fundamental ideas, I'm going to be imprecise in my language at first and add more precision as I develop the core ideas. As a bonus, I'll throw in a different version of the CLT that has some lesser-known consequences.

How we use a few numbers to represent a lot of numbers

In all areas of life, we use one or two numbers to represent lots of numbers. For example, we talk about the average value of sales, the average number of goals scored per match, average salaries, average life expectancy, and so on. Usually, but not always, we get these numbers through some form of sampling, for example, we might run a salary survey asking thousands of people their salary, and from that data work out a mean (a sampling mean). Technically, the average is something mathematicians call a "measure of central tendency" which we'll come back to later.

We know not everyone will earn the mean salary and that in reality, salaries are spread out. We express the spread of data using the standard deviation. More technically, we use something called a confidence interval which is based on the standard deviation. The standard deviation (or confidence interval) is a measure of how close we think our sampling mean is to the true (population) mean (how confident we are).

In practice, we use standard formula for the mean and standard deviation. These are available as standard functions in spreadsheets and programming languages. Mathematically, this is how they're expressed.

\[sample\; mean\; \bar{x}= \frac{1}{N}\sum_{i=0}^{N}x_i\]

\[sample\; standard\; deviation\; s_N = \sqrt{\frac{1}{N} \sum_{i=0}^{N} {\left ( x_i - \bar{x} \right )} ^ 2 } \]

All of this seems like standard stuff, but there's a reason why it's standard, and that's the central limit theorem (CLT).

The CLT

Let's look at three different data sets with different distributions: uniform, Poisson, and power law as shown in the charts below.

These data sets are very, very different. Surely we have to have different averaging and standard deviation processes for different distributions? Because of the CLT, the answer is no.

In the real world, we sample from populations and take an average (for example, using a salary survey), so let's do that here. To get going, let's take 100 samples from each distribution and work out a sample mean. We'll do this 10,000 times so we get some kind of estimate for how spread out our sample means are.

The top charts show the original population distribution and the bottom charts show the result of this sampling means process. What do you notice?

The distribution of the sampling means is a normal distribution regardless of the underlying distribution.

This is a very, very simplified version of the CLT and it has some profound consequences, the most important of which is that we can use the same averaging and standard deviation functions all the time.

Some gentle theory

Proving the CLT is very advanced and I'm not going to do that here. I am going to show you through some charts what happens as we increase the sample size.

Imagine I start with a uniform random distribution like the one below.

I want to know the mean value, so I take some samples and work out a mean for my samples. I do this lots of times and work out a distribution for my mean. Here's what the results look like for a sample size of 2, 3,...10,...20,...30,...40.

As the sample size gets bigger, the distribution of the means gets closer to a normal distribution. It's important to note that the width of the curve gets narrower with increasing sample size. Once the distribution is "close enough" to the normal distribution (typically, around a sample size of 30), you can use normal distribution methods like the mean and standard deviation.

The standard deviation is a measure of the width of the normal distribution. For small sample sizes, the standard deviation underestimates the width of the distribution, which has important consequences.

Of course, you can do this with almost any underlying distribution, I'm just using a uniform distribution because it's easier to show the results

Implications for averages

The charts above show how the distribution of the means changes with sample size. At low sample sizes, there are a lot more "extreme" values as the difference between the sample sizes of 2 and 40 shows. Bear in mind, the width of the distribution is an estimate of the uncertainty in our measurement of the mean.

For small sample sizes, the mean is a poor estimator of the "average" value; it's extremely prone to outliers as the shape of the charts above indicates. There are two choices to fix the problem: either increase the sample size to about 30 or more (which often isn't possible) or use the median instead (the median is much less prone to outliers, but it's harder to calculate).

The standard deviation (and the related confidence interval) is a measure of the uncertainty in the mean value. Once again, it's sensitive to outliers. For small sample sizes, the standard deviation is a poor estimator for the width of the distribution. There are two choices to fix the problem, either increase the sample size to 30 or more (which often isn't possible) or use quartiles instead (for example, the interquartile range, IQR).

If this sounds theoretical, let me bring things down to earth with an example. Imagine you're evaluating salesperson performance based on deals closed in a quarter. In B2B sales, it's rare for a rep to make 30 sales in a quarter, in fact, even half that number might be an outstanding achievement. With a small number of samples, the distribution is very much not normal, and as we've seen in the charts above, it's prone to outliers. So an analysis based on mean sales with a standard deviation isn't a good idea; sales data is notorious for outliers. A much better analysis is the median and IQR. This very much matters if you're using this analysis to compare rep performance.

Implications for statistical tests

A hundred years ago, there were very few large-scale tests, for example, medical tests typically involved small numbers of people. As I showed above, for small sample sizes the CLT doesn't apply. That's why Gosset developed the Student's t-distribution: the sample sizes were too small for the CLT to kick in, so he needed a rigorous analysis procedure to account for the wider-than-normal distributions. The point is, the Student's t-distribution applies when sample sizes are below about 30.

Roll forward 100 years and we're now doing retail A/B testing with tens of thousands of samples or more. In large-scale A/B tests, the z-test is a more appropriate test. Let me put this bluntly: why would you use a test specifically designed for small sample sizes when you have tens of thousands of samples?

It's not exactly wrong to use the Student's t-test for large sample sizes, it's just dumb. The special features of the Student's t-test that enable it to work with small sample sizes become irrelevant. It's a bit like using a spanner as a hammer; if you were paying someone to do construction work on your house and they were using the wrong tool for something simple, would you trust them with something complex?

I've asked about statistical tests at interview and I've been surprised at the response. Many candidates have immediately said Student's t as a knee-jerk response (which is forgivable). Many candidates didn't even know why Student's t was developed and its limitations (not forgivable for senior analytical roles). One or two even insisted that Student's t would still be a good choice even for sample sizes into the hundreds of thousands. It's very hard to progress candidates who insist on using the wrong approach even after it's been pointed out to them.

As a practical matter, you need to know what statistical tools you have available and their limitations.

Implications for sample sizes

I've blithely said that the CLT applies above a sample size of 30. For "most" distributions, a sample size of about 30 is a reasonable rule-of-thumb, but there's no theory behind it. There are cases where a sample size of 30 is insufficient.

At the time of writing, there's a discussion on the internet about precisely this point. There's a popular article on LessWrong that illustrates how quickly convergence to the normal can happen: https://www.lesswrong.com/posts/YM6Qgiz9RT7EmeFpp/how-long-does-it-take-to-become-gaussian but there's also a counter article that talks about cases where convergence can take much longer: https://two-wrongs.com/it-takes-long-to-become-gaussian.

The takeaway from this discussion is straightforward. Most of the time, using a sample size of 30 is good enough for the CLT to kick-in, but occasionally you need larger sample sizes. A good way to test this is to use larger sample sizes and see if there's any trend in the data.

General implications

The CLT is a double-edged sword: it enables us to use the same averaging processes regardless of the underlying distribution, but it also lulls us into a false sense of security and analysts have made blunders as a result.

Any data that's been through an averaging process will tend to follow a normal distribution. For example, if you were analyzing average school test scores you should expect them to follow a normal distribution, similarly for transaction values by retail stores, and so on. I've seen data scientists claim brilliant data insights by announcing their data is normally distributed, but they got it through an averaging process, so of course it was normally distributed.

The CLT is one of the reasons why the normal distribution is so prevalent, but it's not the only reason and of course, not all data is normally distributed. I've seen junior analysts make mistakes because they've assumed their data is normally distributed when it wasn't.

A little more rigor

I've been deliberately loose in my description of the CLT so far so I can explain the general idea. Let's get more rigorous so we can dig into this a bit more. Let's deal with some terminology first.

Central tendency

In statistics, there's something called a "central tendency" which is a measurement that summarizes a set of data by giving a middle or central value. This central value is often called the average. More formally, there are three common measures of central tendency:

The mode. This is the value that occurs most often.
The median. Rank order the data and this is the middle value.
The mean. Sum up all the data and divide by the number of values.

These three measures of central tendency have different properties, different advantages, and different disadvantages. As an analyst, you should know what they are.

(Depending on where you were educated, there might be some language issues here. My American friends tell me that in the US, the term "average" is always a synonym for the mean, in Britain, the term "average" can be the mean, median, or mode but is most often the mean.)

For symmetrical distributions, like the normal distribution, the mean, median, and mode are the same, but that's not the case for non-symmetrical distributions.

The term "central" in the central limit theorem is referring to the central or "average" value.

iid

If you were taught about the Central Limit Theorem, you were probably taught that it only applies to iid data, which means independent and identically distributed. Here's what iid means.

Each sample in the data is independent of the other samples. This means selecting or removing a sample does not affect the value of another sample.
All the samples come from the same probability distribution.

Actually, this isn't true. The CLT applies even if the distributions are not the same. However, the independence requirement still holds,

When the CLT doesn't apply

Fortunately for us, the CLT applies to almost all distributions an analyst might come across, but there are exceptions. The underlying distribution must have a finite variance, which rules out using it with distributions like the Cauchy distribution. The samples must be iid as I said before.

A re-statement of the CLT

Given data that's distributed with a finite variance and is iid, if we take n samples, then:

as \( n \to \infty \), the sample mean converges to the population mean
as \( n \to \infty \), the distribution of the sample means approximates a normal distribution

Note this formulation is in terms of the mean. This version of the CLT also applies to sums because the mean is just the sum divided by a constant (the number of samples).

A different version of the CLT

There's another version of the CLT that's not well-known but does come up from time to time in more advanced analysis. The usual version of the CLT is expressed in terms of means (which is the sum divided by a constant). If instead of taking the sum of the samples, we take their product, then instead of the products tending to a normal distribution they tend to a log-normal distribution. In other words, where we have a quantity created from the product of samples then we should expect it to follow a log-normal distribution.

What should I take away from all this?

Because of the CLT, the mean and standard deviation mostly work regardless of the underlying distribution. In other words, you don't have to know how your data is distributed to do basic analysis on it. BUT the CLT only kicks in above a certain sample size (which can vary with the underlying distribution but is usually around 30) and there are cases when it doesn't apply.

You should know what to do when you have a small sample size and know what to watch out for when you're relying on the CLT.

You should also understand that any process that sums (or products) data will lead to a normal distribution (or log-normal).

Monday, November 28, 2022

Is this coin biased?

Tossing and turning

A few months ago, someone commented on one of my blog posts and asked how you work out if a coin is biased or not. I've been thinking about the problem since then. It's not a difficult one, but it does bring up some core notions in probability theory and statistics which are very relevant to understanding how A/B testing works, or indeed any kind of statistical test. I'm going to talk you through how you figure out if a coin is biased, including an explanation of some of the basic ideas of statistical tests.

The trial

A single coin toss is an example of something called a Bernoulli trial, which is any kind of binary decision you can express as a success or failure (e.g. heads or tails). For some reason, most probability texts refer to heads as a success.

We can work out what the probability is of getting different numbers of heads from a number of tosses, or more formally, what's the probability \(P(k)\) of getting \(k\) heads from \(n\) tosses, where \(0 < k ≤ n\)? By hand, we can do it for a few tosses (three in this case):

Number of heads (k)	Combinations (n)	Count	Probability
0	TTT	1	1/8
1	HTT THT TTH	3	3/8
2	THH HTH HHT	3	3/8
4	HHH	1	1/8

But what about 1,000 or 1,000,000 tosses - we can't do this many by hand, so what can we do? As you might expect, there's a formula you can use:

\[P(k) = \frac{n!} {k!(n-k)!} p^k (1-p)^{n-k}\]

\(p\) is the probability of success in any trial, for example, getting a head. For an unbiased coin \(p=0.5\); for a coin that's biased 70% heads \(p=0.7\).

If we plot this function for an unbiased coin (\(p=0.5\)), where \(n=100\), and \(0 < k ≤ n\), we see this probability distribution:

This is called a binomial distribution and it looks a lot like the normal distribution for large (\(> 30\)) values of \(n\).

I'm going to re-label the x-axis as a score equal to the fraction of heads: 0 means all tails, 0.5 means \(\frac{1}{2}\) heads, and 1 means all heads. With this slight change, we can more easily compare the shape of the distribution for different values of \(n\).

I've created two charts below for an unbiased coin (\(p=0.5\)), one with \(n=20\) and one with \(n=40\). Obviously, the \(n=40\) chart is narrower, which is easier to see using the score as the x-axis.

As an illustration of what these charts mean, I've colored all scores 0.7 and higher as red. You can see the red area is bigger for \(n=20\) than \(n=40\). Bear in mind, the red area represents the probability of a score of 0.7 or higher. In other words, if you toss a fair coin 20 times, you have a 0.058 chance of seeing a score of 0.7 or more, but if you toss a fair coin 40 times, the probability of seeing a 0.7 score drops to 0.008.

These charts tell us something useful: as we increase the number of tosses, the curve gets narrower, meaning the probability of getting results further away from \(0.5\) gets smaller. If we saw a score of 0.7 for 20 tosses, we might not be able to say the coin was biased, but if we got a score of 0.7 after 40 tosses, we know this score is very unlikely so the coin is more likely to be biased.

Thresholds

Let me re-state some facts:

For any coin (biased or unbiased) any score from 0 to 1 is possible for any number of tosses.
Some results are less likely than others; e.g. for an unbiased coin and 40 tosses, there's only a 0.008 chance of seeing a score of 0.7.

We can use probability thresholds to decide between biased and non-biased coins. We're going to use a threshold (mostly called confidence) of 95% to decide if the coin is biased or not. In the chart below, the red areas represent 5% probability, and the blue areas 95% probability.

Here's the idea to work out if the coin is biased. Set a confidence value, usually at 0.05. Throw the coin \(n\) times, record the number of heads and work out a score. Draw the theoretical probability chart for the number of throws (like the one I've drawn above) and color in 95% of the probabilities blue and 5% red. If the experimental score lands in the red zones, we'll consider the coin to be biased, if it lands in the blue zone, we'll consider it unbiased.

This is probabilistic decision-making. Using a confidence of 0.05 means we'll wrongly say a coin is biased 5% of the time. Can we make the threshold higher, could we use 0.01 for instance? Yes, we could, but the cost is increasing the number of trials.

As you might expect, there are shortcuts and we don't actually have to draw out the chart. In Python, you can use the binom_test function in the stats package.

To simplify, binom_test has three arguments:

x - the number of successes

n - the number of samples

p - the hypothesized probability of success

It returns a p-value which we can use to make a decision.

Let's see how this works with a confidence of 0.05. Let's take the case where we have 200 coin tosses and 140 (70%) of them come up heads. We're hypothesizing that the coin is fair, so \(p=0.5\).

from scipy import stats

print(stats.binom_test(x=140, n=200, p=0.5))

the p-value we get is 1.5070615573524992e-08 which is way less than our confidence threshold of 0.05 (we're in the red area of the chart above). We would then reject the idea the coin is fair.

What if we got 112 heads instead?

from scipy import stats

print(stats.binom_test(x=115, n=200, p=0.5))

This time, the p-value is 0.10363903843786755, which is greater than our confidence threshold of 0.05 (we're in the blue area of the chart), so the result is consistent with a fair coin (we fail to reject the null).

What if my results are not significant? How many tosses?

Let's imagine you have reason to believe the coin is biased. You throw it 200 times and you see 115 heads. binom_test tells you you can't conclude the coin is biased. So what do you do next?

The answer is simple, toss the coin more times.

The formulae for the sample size, \(n\), is:

\[n = \frac{p(1-p)} {\sigma^2}\]

where \(\sigma\) is the standard error.

Here's how this works in practice. Let's assume we think our coin is just a little biased, to 0.55, and we want the standard error to be \(\pm 0.04\). Here's how many tosses we would need: 154. What if we want more certainty, say \(\pm 0.005\), then the number of tosses goes up to 9,900. In general, the bigger the bias, the fewer tosses we need, and the more certainty we want the more tosses we need.

If I think my coin is biased, what's my best estimate of the bias?

Let's imagine I toss the coin 1,000 times and see 550 heads. binom_test tells me the result is significant and it's likely my coin is biased, but what's my estimate of the bias? This is simple, it's actually just the mean, so 0.55. Using the statistics of proportions, I can actually put a 95% confidence interval around my estimate of the bias of the coin. Through math I won't show here, using the data we have, I can estimate the coin is biased 0.55 ± 0.03.

Is my coin biased?

This is a nice theoretical discussion, but how might you go about deciding if a coin is biased? Here's a step-by-step process.

Decide on the level of certainty you want in your results. 95% is a good measure.
Decide the minimum level of bias you want to detect. If the coin should return heads 50% of the time, what level of bias can you live with? If it's biased to 60%, is this OK? What about biased to 55% or 50.5%?
Calculate the number of tosses you need.
Toss your coin.
Use binom_test to figure out if the coin deviates significantly from 0.5.

What does data science boil down to?

Data science is a relatively new discipline that means different things to different people (most notably, to different employers). Some organizations focus solely on machine learning, while other lean on interpretation, and yet others get close to data engineering. In my view, all of these are part of the data science role.

I would argue data science generally is about three distinct areas:

Prediction. The ability to accurately extrapolate from existing data sets to make forecasts about future behavior. This is the famous machine learning aspect and includes solutions like recommender systems.
Distinction. The key question here is: "are these numbers different?". This includes the use of statistical techniques to decide if there's a difference or not, for example, specifying an A/B test and explaining its results.
Interpretation. What are the factors that are driving the system? This is obviously related to prediction but has similarities to distinction too.

(A similar view of data science to mine: Calvin.Andrus, CC BY-SA 3.0, via Wikimedia Commons)

I'm going to talk through these areas and list the skills I think a data scientist needs. In my view, to be effective, you need all three areas. The real skill is to understand what type of problem you face and to use the correct approach.

Distinction - are these numbers different?

This is perhaps the oldest area and the one you might disagree with me on. Distinction is firmly in the realm of statistics. It's not just about A/B tests or quasi-experimental tests, it's also about evaluating models too.

Here's what you need to know:

Confidence intervals.
Sample size calculations. This is crucial and often overlooked by experienced data scientists. If your data set is too small, you're going to get junk results so you need to know what too small is. In the real world. increasing the sample size is often not an option and you need to know why.
Hypothesis testing. You should know the difference between a t-test and a z-test and when a z-test is appropriate (hint: sample size).
α, β, and power. Many data scientists have no idea what statistical power is. If you're doing any kind of statistical testing, you need to have a firm grasp of power.
The requirements for running a randomized control trial (RCT). Some experienced data scientists have told me they were analyzing results from an RCT, but their test just wasn't an RCT - they didn't really understand what an RCT was.
Quasi-experimental methods. Sometimes, you just can't run an RCT, but there are other methods you can use including difference-in-difference, instrumental variables, and regression discontinuity. You need to know which method is appropriate and when.
Regression to the mean. This is why you almost always need a control group. I've seen experienced data scientists present results that could almost entirely be explained by regression to the mean. Don't be caught out by one of the fundamentals of statistics.

Prediction - what will happen next?

This is the piece of data science that gets all the attention, so I won't go into too much detail.

Here's what you need to know:

The basics of machine learning models, including:

Generalized linear modeling
Random forests (including knowing why they are often frowned upon)
k-nearest neighbors/k-means clustering
Support Vector Machines
Gradient boosting.

Cross-validation, regularization, and their limitations.
Variable importance and principal component analysis.
Loss functions, including RMSE.
The confusion matrix, accuracy, sensitivity, specificity, precision-recall and ROC curves.

There's one topic that's not on any machine learning course or in any machine learning book that I've ever read, but it's crucially important: knowing when machine learning fails and when to stop a project. Machine learning doesn't work all the time.

Interpretation - what's going on?

The main techniques here are often data visualization. Statistical summaries are great, but they can often mislead. Charts give a fuller picture.

Here are some techniques all data scientists should know:

Heatmaps
Violin plots
Scatter plots and curve fitting
Bar charts
Regression and curve fitting.

They should also know why pie charts in all their forms are bad.

A good knowledge of how charts work is very helpful too (the psychology of visualization).

What about SQL and R and Python...?

You need to be able to manipulate data to do data science, which means SQL, Python, or R. But plenty of people use these languages without being data scientists. In my view, despite their importance, they're table stakes.

Book knowledge vs. street knowledge

People new to data science tend to focus almost exclusively on machine learning (prediction in my terminology) which leaves them very weak on data analysis and data exploration; even worse, their lack of statistical knowledge sometimes leads them to make blunders on sample size and loss functions. No amount of cross-validation, regularization, or computing power will save you from poor modeling choices. Even worse, not knowing statistics can lead people to produce excellent models of regression to the mean.

Practical experience is hugely important; way more important than courses. Obviously, a combination of both is best, which is why PhDs are highly sought after; they've learned from experience and have the theoretical firepower to back up their practical knowledge.

Friday, December 31, 2021

COVID and the base rate fallacy

Should we be concerned that vaccinated people are getting COVID?

I’ve spoken to people who’re worried that the COVID vaccines aren’t effective because some vaccinated people catch COVID and are hospitalized. Let’s look at the claim and see if it stands up to analysis.

Let's start with some facts:

Vaccines aren’t 100% effective for 100% of people. For example, the smallpox vaccine is 95% effective at preventing infection (https://www.health.ny.gov/publications/7022/).

Even if vaccines don’t prevent infection, they can reduce the severity of symptoms (e.g. https://www.cdc.gov/flu/vaccines-work/vaccineeffect.htm).
Vaccines slow or prevent the spread of disease by providing fewer hosts for infection – in other words, herd immunity (https://www.mayoclinic.org/diseases-conditions/coronavirus/in-depth/herd-immunity-and-coronavirus/art-20486808)

Medical tests aren’t 100% effective either; there are false positives and false negatives.

False positives mean the test says positive when really you’re negative.
False negatives mean the test says you’re negative when really you’re positive.

Marc Rummy’s diagram

Marc Rummy created this diagram to explain what’s going on with COVID hospitalizations. He’s made it free to share, which is fantastic.

In this diagram, the majority of the population is vaccinated (91%). The hospitalization rate for the unvaccinated is 50% but for the vaccinated, it’s 10%. If the total population is 110, this leads to 5 unvaccinated people hospitalized and 10 vaccinated people hospitalized - in other words, 2/3 of those in hospital with COVID have been vaccinated.

Explaining the result

Let’s imagine we just looked at hospitalizations: 5 unvaccinated and 10 vaccinated. This makes it look like vaccinations aren’t working – after all, the majority of people in hospital are vaccinated. You can almost hear ignorant journalists writing their headlines now (“Questions were raised about vaccine effectiveness when the health minister revealed the majority of patients hospitalized had been vaccinated.”). But you can also see anti-vaxxers seizing on these numbers to try and make a point about not getting vaccinated.

The reason why the numbers are the way they are is because the great majority of people are vaccinated.

Let’s look at three different scenarios with the same population of 110 people and the same hospitalization rates for vaccinated and unvaccinated:

0% vaccinated – 55 people hospitalized
91% vaccinated – 15 people hospitalized
100% vaccinated – 11 people hospitalized

Clearly, vaccinations reduce the number of hospitalizations. The anti-vaccine argument seems to be, if it doesn't reduce the risk to zero, it doesn't work - which is a strikingly weak and ignorant argument.

In this example, vaccination doesn’t reduce the risk of infection to zero, it reduces it by a factor of 5. In the real world, vaccination reduces the risk of infection by 5x and the risk of death due to COVID by 13x (https://www.nytimes.com/interactive/2021/us/covid-cases.html). The majority of people hospitalized now appear to be unvaccinated even though vaccination rates are only just above 60% in most countries (https://www.nytimes.com/interactive/2021/world/covid-cases.html, https://www.masslive.com/coronavirus/2021/09/breakthrough-covid-cases-in-massachusetts-up-to-about-40-while-unvaccinated-people-dominate-hospitalizations.html).

The bottom line is very simple: if you want to reduce your risk of hospitalization and protect your family and community, get vaccinated.

The base rate fallacy

The mistake the anti-vaxxers and some journalists are making is a very common one, it’s called the base rate fallacy (https://thedecisionlab.com/biases/base-rate-fallacy/). There are lots of definitions online, so I’ll just attempt a summary here: “the base rate fallacy is where someone draws an incorrect conclusion because they didn’t take into account the base rate in the general population. It’s especially a problem for conditional probability problems.”

Let’s use another example from a previous blog post:

“Imagine there's a town of 10,000 people. 1% of the town's population has a disease. Fortunately, there's a very good test for the disease:

If you have the disease, the test will give a positive result 99% of the time (sensitivity).
If you don't have the disease, the test will give a negative result 99% of the time (specificity).

You go into the clinic one day and take the test. You get a positive result. What's the probability you have the disease?”

The answer is 50%.

The reason why the answer is 50% and not 99% is because 99% of the town’s population does not have the disease (the base rate), which means half of the positives will be false positives.

What’s to be done?

Conditional probability (for example, the COVID hospitalization data) is screwy and can sometimes seem counter to common sense. The general level of statistical (and probability) knowledge in the population is poor. This leaves people trying to make sense of the data around them but without the tools to do it, so no wonder they’re confused.

It’s probably time that all schoolchildren are taught some basic statistics. This should include some counter-intuitive results (for example, the disease example above). Even if very few schoolchildren grow up to analyze data, it would be beneficial for society if more people understood that interpreting data can be hard and that sometimes surprising results occur – but that doesn’t make them suspicious or wrong.

More importantly, journalists need to do a much better job of telling the truth and explaining the data instead of chasing cheap clicks.

Monday, July 12, 2021

What is beta in statistical testing?

\(\beta\) is \(\alpha\) if there's an effect

In hypothesis testing, there are two kinds of errors:

Type I - we say there's an effect when there isn't. The threshold here is \(\alpha\).
Type II - we say there's no effect when there really is an effect. The threshold here is \(\beta\).

This blog post is all about explaining and calculating \(\beta\).

The null hypothesis

Let's say we do an A/B test to measure the effect of a change to a website. Our control branch is the A branch and the treatment branch is the B branch. We're going to measure the conversion rate \(C\) on both branches. Here are our null and alternative hypotheses:

\(H_0: C_B - C_A = 0\) there is no difference between the branches
\(H_1: C_B - C_A \neq 0\) there is a difference between the branches

Remember, we don't know if there really is an effect, we're using procedures to make our best guess about whether there is an effect or not, but we could be wrong. We can say there is an effect when there isn't (Type I error) or we can say there is no effect when there is (Type II error).

Mathematically, we're taking the mean of thousands of samples so the central limit theorem (CLT) applies and we expect the quantity \(C_B - C_A\) to be normally distributed. If there is no effect, then \(C_B - C_A = 0\), if there is an effect \(C_B - C_A \neq 0\).

\(\alpha\) in a picture

Let's assume there is no effect. We can plot out our expected probability distribution and define an acceptance region (blue, 95% of the distribution) and two rejection regions (red, 5% of the distribution). If our measured \(C_B - C_A\) result lands in the blue region, we will accept the null hypothesis and say there is no effect, If our result lands in the red region, we'll reject the null hypothesis and say there is an effect. The red region is defined by \(\alpha\).

One way of looking at the blue area is to think of it as a confidence interval around the mean \(x_0\):

\[\bar x_0 + z_\frac{\alpha}{2} s \; and \; \bar x_0 + z_{1-\frac{\alpha}{2}} s \]

In this equation, s is the standard error in our measurement. The probability of a measurement \(x\) lying in this range is:

\[0.95 = P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] \]

If we transform our measurement \(x\) to the standard normal \(z\), and we're using a 95% acceptance region (boundaries given by \(z\) values of 1.96 and -1.96), then we have for the null hypothesis:

\[0.95 = P[-1.96 < z < 1.96]\]

\(\beta\) in a picture

Now let's assume there is an effect. How likely is it that we'll say there's no effect when there really is an effect? This is the threshold \(\beta\).

To draw this in pictures, I want to take a step back. We have two hypotheses:

\(H_0: C_B - C_A = 0\) there is no difference between the branches
\(H_1: C_B - C_A \neq 0\) there is a difference between the branches

We can draw a distribution for each of these hypotheses. Only one distribution will apply, but we don't know which one.

If the null hypothesis is true, the blue region is where our true negatives lie and the red region is where the false positives lie. The boundaries of the red/blue regions are set by \(\alpha\). The value of \(\alpha\) gives us the probability of a false positive.

If the alternate hypothesis is true, the true positives will be in the green region and the false negatives will be in the orange region. The boundary of the green/orange regions is set by \(\beta\). The value of \(\beta\) gives us the probability of a false negative.

Calculating \(\beta\)

Calculating \(\beta\) is calculating the orange area of the alternative hypothesis chart. The boundaries are set by \(\alpha\) from the null hypothesis. This is a bit twisty, so I'm going to say it again with more words to make it easier to understand.

\(\beta\) is about false negatives. A false negative occurs when there is an effect, but we say there isn't. When we say there isn't an effect, we're saying the null hypothesis is true. For us to say there isn't an effect, the measured result must lie in the blue region of the null hypothesis distribution.

To calculate \(\beta\), we need to know what fraction of the alternate hypothesis lies in the acceptance region of the null hypothesis distribution.

Let's take an example so I can show you the process step by step.

Assuming the null hypothesis, set up the boundaries of the acceptance and rejection region. Assuming a 95% acceptance region and an estimated mean of x, this gives the acceptance region as:
\[P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] \] which is the mean and 95% confidence interval for the null hypothesis. Our measurement \(x\) must lie between these bounds.
Now assume the alternate hypothesis is true. If the alternate hypothesis is true, then our mean is \(\bar x_1\).
We're still using this equation from before, but this time, our distribution is the alternate hypothesis.
\[P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] ] \]
Transforming to the standard normal distribution using the formula \(z = \frac{x - \bar x_1}{\sigma}\), we can write the probability \(\beta\) as:
\[\beta = P \left [ \frac{\bar x_0 + z_\frac{\alpha}{2} s - \bar x_1}{s} < z < \frac{ \bar x_0 + z_{1-\frac{\alpha}{2}} s - \bar x_1}{s} \right ] \]

This time, let's put some numbers in.

\(n = 200,000\) (100,000 per branch)
\(C_B = 0.062\)
\(C_A = 0.06\)
\(\bar x_0= 0\) - the null hypothesis
\(\bar x_1 = 0.002\) - the alternate hypothesis
\(s = 0.00107\) - this comes from combining the standard errors of both branches, so \(s^2 = s_A^2 + s_B^2\), and I'm using the usual formula for the standard error of a proportion, for example, \(s_A = \sqrt{\frac{C_A(1-C_A)}{n} }\)

Plugging them all in, this gives:
\[\beta = P[ -3.829 < z < 0.090]\]
which gives \(\beta = 0.536\)

This is too hard

This process is complex and involves lots of steps. In my view, it's too complex. It feels to me that there must be an easier way of constructing tests. Bayesian statistics holds out the hope for a simpler approach, but widespread adoption of Bayesian statistics is probably a generation or two away. We're stuck with an overly complex process using very difficult language.

Reading more

Tuesday, July 6, 2021

Spritely fraud detection

Scientific fraud and business manipulation

Sadly, there's a long history of scientific fraud and misrepresentation of data. Modern computing technology has provided better tools for those trying to mislead, but the fortunate flip side is, modern tools provide ways of exposing misrepresented data. It turns out, the right tools can indicate what's really going on.

(Author: Nick Youngson. License: Creative Commons. Source: Wikimedia)

In business, companies often say they can increase sales, or reduce costs, or do so some other desirable thing. The evidence is sometimes in the form of summary statistics like means and standard deviations. Do you think you could assess the credibility of evidence based on the mean and standard deviation summary data alone?

In this blog post, I'm going to talk about how you can use one tool to investigate the credibility of mean and standard deviation evidence.

Discrete quantities

Discrete quantities are quantities that can only take discrete values. An example is a count, for example, a count of the number of sales. You can have 0, 1, 2, 3... sales, but you can't have -1 sales or 563.27 sales.

Some business quantities are measured on scales of 1 to 5 or 1 to 10, for example, net promoter scores or employee satisfaction scores. These scales are often called Likert scales.

For our example, let's imagine a company is selling a product on the internet and asks its customers how likely they are to recommend the product. The recommendation is on a scale of 0 to 10, where 0 is very unlikely to recommend and 10 is very likely to recommend. This is obviously based on the net promoter idea, but I'm simplifying things here.

Very unlikely to recommend										Very likely to recommend
0	1	2	3	4	5	6	7	8	9	10

Imagine the salesperson for the company tells you the results of a 500-person study are a mean of 9 and a standard deviation of 2.5. They tell you that customers love the product, but obviously, there's some variation. The standard deviation shows you that not everyone's satisfied and that the numbers are therefore credible.

But are these numbers really credible?

Stop for a second and think about it. It's quite possible that their customers love the product. A mean of 9 on a scale of 10 isn't perfection, and the standard deviation of 2.5 suggests there is some variation, which you would expect. Would you believe these numbers?

Investigating credibility

We have three numbers; a mean, a standard deviation, and a sample size. Lots of different distributions could have given rise to these numbers, how can we backtrack to the original data?

The answer is, we can't fully backtrack, but we can investigate possibilities.

In 2018, a group of academic researchers in The Netherlands and the US released software you can use to backtrack to possible distributions from mean and standard deviation data. Their goal was to provide a tool to help investigate academic fraud. They wrote up how their software works and published it online, you can read their writeup here. They called their software SPRITE (Sample Parameter Reconstruction via Iterative TEchniques) and made it open-source, even going so far as to make a version of it available online. The software will show you the possible distributions that could give rise to the summary statistics you have.

One of the online versions is here. Let's plug in the salesperson's numbers to see if they're credible.

If you go to the SPRITE site, you'll see a menu on the left-hand side. In my screenshot, I've plugged in the numbers we have:

Our scale goes from 0 to 10,
Our mean is 9,
Our standard deviation is 2.5,
The number of samples is 500.
We'll choose 2 decimal places for now
We'll just see the top 9 possible distributions.

Here are the top 9 results.

Something doesn't smell right. I would expect the data to show some form of more even distribution about the mean. For a mean of 9, I would expect there to be a number of 10s and a number of 8s too. These estimated distributions suggest that almost everyone is deliriously happy, with just a small handful of people unhappy. Is this credible in the real world? Probably not.

I don't have outright evidence of wrongdoing, but I'm now suspicious of the data. A good next step would be to ask for the underlying data. At the very least, I should view any other data the salesperson provides with suspicion. To be fair to the salesperson, they were probably provided with the data by someone else.

What if the salesperson had given me different numbers, for example, a mean of 8.5, a standard deviation of 1.2, and 100 samples? Looking at the results from SPRITE, the possible distributions seem much more likely. Yes, misrepresentation is still possible, but on the face of it, the data is credible.

Did you spot the other problem?

There's another, more obvious problem with the data. The scale is from 0 to 10, but the results are a mean of 9 and a standard deviation of 2.5, which implies a confidence interval of 6.5 to 11.5. To state the obvious, the maximum score is 10 but the upper range of the confidence interval is 11.5. This type of mistake is very common and doesn't of itself indicate fraud. I'll blog more about this type of mistake later.

What does this mean?

Due diligence is about checking claims for veracity before spending money. If there's a lot of money involved, it behooves the person doing the due diligence to check the consistency of the numbers they've been given. Tools like SPRITE are very helpful for sniffing out areas to check in more detail. However, just because a tool like SPRITE flags something up it doesn't mean to say there's fraud; people make mistakes with statistics all the time. However, if something is flagged up, you need to get to the bottom of it.

Other ways of detecting dodgy numbers

The GRIM test - https://peerj.com/preprints/2064v1/
GRIMMER - https://peerj.com/preprints/2400v1/
https://www.evanmiller.org/how-to-read-an-unlabeled-sales-chart.html

Finding out more

http://crystalprisonzone.blogspot.com/2021/01/i-tried-to-report-scientific-misconduct.html - a great blog post about the use of SPRITE and other tools.
https://hackernoon.com/introducing-sprite-and-the-case-of-the-carthorse-child-58683c2bfeb - a more detailed example
https://towardsdatascience.com/sprite-case-study-2-the-case-of-the-polarizing-porterhouse-and-some-updates-7dfe4d1564fc
https://jamesheathers.medium.com/sprite-case-study-3-soup-is-good-albeit-extremely-confusing-food-96ea526c488d
https://www.sciencemag.org/news/2018/02/meet-data-thugs-out-expose-shoddy-and-questionable-research
https://osf.io/pwjad/ - SPRITE source code

Monday, March 10, 2025

Normal is all around you, and so is not-normal

A rose by any other name

What it is and what it looks like

Why the normal distribution is so important

What useful things can I do with the normal distribution?

Some properties of the normal distribution

Abuses of the normal distribution and what you can do

History - where the normal came from

The math

Thursday, February 13, 2025

Independence in probability

Why should I care about independence?

What is independence?

Why people assume independence

Harder examples

Why independence fails

What to do - recommendations

The math

Monday, July 31, 2023

Knowing the Central Limit Theorem means avoiding costly mistakes

How we use a few numbers to represent a lot of numbers

The CLT

Some gentle theory

Implications for averages

Implications for statistical tests

Implications for sample sizes

General implications

A little more rigor

Central tendency

iid

When the CLT doesn't apply

A re-statement of the CLT

A different version of the CLT

What should I take away from all this?

Monday, November 28, 2022

Tossing and turning

The trial

Thresholds

What if my results are not significant? How many tosses?

If I think my coin is biased, what's my best estimate of the bias?

Is my coin biased?

Other posts like this

Friday, January 7, 2022

What does data science boil down to?

Distinction - are these numbers different?

Prediction - what will happen next?

Interpretation - what's going on?

What about SQL and R and Python...?

Book knowledge vs. street knowledge

Friday, December 31, 2021

COVID and the base rate fallacy

Should we be concerned that vaccinated people are getting COVID?

Marc Rummy’s diagram

Explaining the result

The base rate fallacy

What’s to be done?

Monday, July 12, 2021

\(\beta\) is \(\alpha\) if there's an effect

The null hypothesis

\(\alpha\) in a picture

\(\beta\) in a picture

Calculating \(\beta\)

This is too hard

Reading more

Tuesday, July 6, 2021

Scientific fraud and business manipulation

Discrete quantities

Investigating credibility

Did you spot the other problem?

What does this mean?

Other ways of detecting dodgy numbers

Finding out more