Showing posts with label analytics. Show all posts
Showing posts with label analytics. Show all posts

What were the numbers?

Often in business, we're presented with charts where the y-axis is unlabeled because the presenter wants to conceal the numbers. Are there ways of reconstructing the labels and figuring out what the data is? Surprisingly, yes there are.

Given a chart like this:

you can often figure out what the chart values should be.

The great Evan Miller posted on this topic several years ago ("How To Read an Unlabeled Sales Chart"). He discussed two methods:

• Greatest common divisor (gcd)
• Poisson distribution

In this blog post, I'm going to take his gcd work a step further and present code and a process for reconstructing numbers under certain circumstances. In another blog post, I'll explain the Poisson method.

The process I'm going to describe here will only work:

• Where the underlying data is integers
• Where there's 'enough' range in the underlying data.
• Where the maximum underlying data is less than about 200.
• Where the y-axis includes zero.

The results

I generated this chart without axes labels, the goal being to recreate the underlying data. I measured screen y-coordinates of the top and bottom plot borders (187 and 677) and I measured the y coordinates of the top of each of the bars. Using the process and code I describe below, I was able to correctly recreate the underlying data values, which were $$[33, 30, 32, 23, 32, 26, 18, 59, 47]$$.

How plotting packages work

To understand the method, we need to understand how a plotting package will render a set of integers on a chart.

Let's take the list of numbers $$[1, 2, 3, 5, 7, 11, 13, 17, 19, 23]$$ and call them $$y_o$$.

When a plotting package renders $$y_o$$ on the screen, it will put them into a chart with screen x-y coordinates. It's helpful to think about the chart on the screen as a viewport with x and y screen dimensions. Because we only care about the y dimensions, that's what I'll talk about. On the screen, the viewport might go from 963 pixels to 30 pixels on the y-axis, a total range of 933 y-pixels.

Here's how the numbers $$y_o$$ might appear on the screen and how they map to the viewport y-coordinates. Note the origin is top left, not bottom right. I'll "correct" for the different origin.

The plotting package will translate the numbers $$y_o$$ to a set of screen coordinates I'll call $$y_s$$. Assuming our viewport starts from 0, we have:

$y_s = my_o$

Let's just look at the longest bar that corresponds to the number 23. My measurements of the start and end are 563 and 27, which gives a length of 536. $$m$$ in this case is 536/23, or 23.3.

There are three things to bear in mind:

• The set of numbers $$y_o$$ are integers
• The set of numbers $$y_s$$ are integers - we can't have half a pixel for example.
• The scalar $$m$$ is a real number

Integer only solutions for $$m$$

In Evan Miller's original post, he only considered integer values of $$m$$. If we restrict ourselves to integers, then most of the time:

$m = gcd(y_s)$

where gcd is the greatest common divisor.

To see how this works, let's take:

$y_o = [1 , 2, 3]$

and

$m = 8$

These numbers give us:

$y_s = [8, 16, 24]$

To find the gcd in Python:

np.gcd.reduce([8, 16, 24])

which gives $$m = 8$$, which is correct.

If we could guarantee $$m$$ was an integer, we'd have an answer; we'd be able to reconstruct the original data just using the gcd function. But we can't do that in practice for three reasons:
1. $$m$$ isn't always an integer.
2. There are measurement errors which mean there will be some uncertainty in our $$y_s$$ values.
3. It's possible the original data set $$y_o$$ has a gcd which is not 1.

In practice, we gather screen coordinates using a manual process which will introduce errors. At most, we're likely to be off by a few pixels for each measurement, however, even the smallest error will mean the gcd method won't work. For example, if the value on the screen should be 500 but we might incorrectly measure it as 499, this small error means the method fails (there is a way around this failure that will work for small measurement errors.)

If our original data set has a gcd greater than 1, the method won't work. Let's say our data was:

$y_o = [2, 4, 6]$

and:

$m=8$

we would have:

$y_s = [16, 32, 48]$

which has a gcd of 16, which is an incorrect estimate of $$m$$. In practice, the odds of the original data set $$y_o$$ having a gcd > 1 are low.

The real killer for this approach is the fact that $$m$$ is highly likely in practice to be a real number.

Real solutions for $$m$$

The only way I've found for solving for $$m$$ is to try different values for $$m$$ to see what succeeds. To get this to work, we have to constrain $$m$$ because otherwise there would be an infinite number of values to try. Here's how I constrain $$m$$:

• I limit the steps for different $$m$$ values to 0.01.
• I start my m values from just over 1 and I stop at a maximum $$m$$ value. My maximum $$m$$ value I get from assuming the smallest value I measure on the screen corresponds to a data value of 1, for example, if the smallest measurement is 24 pixels, the smallest possible original data is 1, so the maximum value for $$m$$ is 24.

Now we've constrained $$m$$, how do we evaluate $$y_s = my_o$$? First off, we define an error function. We want our estimates of the original data $$y_o$$ to be integers, so the further away we are from an integer, the worse the error. For the $$i$$th element of our estimate of $$y_o$$, the error estimate is:

$\frac{y_{si}}{m_{estimate}} - \frac{y_{si}}{m_{estimate}}$

we're choosing the least square error, which means minimizing:

$\frac{1}{n} \sum \left ( round \left ( \frac{y_{si}}{m_{estimate}} \right ) - \frac{y_{si}}{m_{estimate}} \right )^2$

in code, this comes out as:

sum([(round(_y/div) - _y/div)**2 for _y in y])/len(y)

Our goal is to try different values of $$m$$ and choose the solution that yields the lowest error estimate.

The solution in practice

Before I show you how this works, there are two practicalities. The first is that $$m=1$$ is always a solution and will always give a zero error, but it's probably not the right solution, so we're going to ignore $$m=1$$. Secondly, there will be an error in our measurements due to human error. I'm going to assume the maximum error is 3 pixels for any measurement. To calculate a length, we take a measurement of the start and end of the bar (if it's a bar chart), which means our maximum uncertainty is 2*3. That's why I set my maximum $$m$$ to be min(y) + 2*MAX_ERROR.

To show you how this works, I'll talk you through an example.

The first step is measurement. We need to measure the screen y-coordinates of the plot borders and the top of the bars (or the position of the points on a scatter chart). If the plot doesn't have borders, just measure the position of the bottom of the bars and the coordinate of the highest bar. Here are some measurements I took.

Here are the measurements of the top of the bars (_y_measured): $$[482, 500, 489, 541, 489, 523, 571, 329, 399]$$

Here are the start and stop coordinates of the plot borders (_start, _stop):  $$677, 187$$

To convert these to lengths, the code is just: [_start - _y_m for _y_m in _y_measured]

The length of the screen from the top to the bottom is: _start - _stop = $$490$$

This gives us measured length (y_measured): $$[195, 177, 188, 136, 188, 154, 106, 348, 278]$$

Now we run this code:

MAX_ERROR = 3

STEP = 0.01

ERROR_THRESHOLD = 0.01

def mse(y, div):

"""Means square error calculation."""

return sum([(round(_y/div) - _y/div)**2 for _y in y])/len(y)

def find_divider(y):

"""Return the non-integer that minimizes the error function."""

error_list = []

for _div in np.arange(1 + STEP,

min(y) + 2*MAX_ERROR,

STEP):

error_list.append({"divider": _div,

"error":mse(y, _div)})

df_error = pd.DataFrame(error_list)

df_error.plot(x='divider', y='error', kind='scatter')

_slice = df_error[df_error['error'] == df_error['error'].min()]

divider = _slice['divider'].to_list()

error = _slice['error'].to_list()

if error > ERROR_THRESHOLD:

raise ValueError('The estimated error is {0} which is '

'too large for a reliable result.'.format(error))

return divider

def find_estimate(y, y_extent):

"""Make an estimate of the underlying data."""

if (max(y_measured) - min(y_measured))/y_extent < 0.1:

raise ValueError('Too little range in the data to make an estimate.')

m = find_divider(y)

return [round(_e/m) for _e in y_measured], m

estimate, m = find_estimate(y_measured, y_extent)

This gives us this output:

Original numbers: [33, 30, 32, 23, 32, 26, 18, 59, 47]

Measured y values: [195, 177, 188, 136, 188, 154, 106, 348, 278]

Divider (m) estimate: 5.900000000000004

Estimated original numbers: [33, 30, 32, 23, 32, 26, 18, 59, 47]

Which is correct.

Limitations of this approach

Here's when it won't work:

• If there's little variation in the numbers on the chart, then measurement errors tend to overwhelm the calculations and the results aren't good.
• In a similar vein, if the numbers are all close to the top or the bottom of the chart, measurement errors lead to poor results.
• $$m < 1$$, which as the maximum y viewport range is usually in the range 500-900 pixels, it won't work for numbers greater than about 500.
• I've found in practice that if $$m < 3$$ the results can be unreliable. Arbitrarily, I call any error greater than 0.01 too high to protect against poor results. Maybe, I should limit the results to $$m > 3$$.

I'm not entirely convinced my error function is correct; I'd like an error function that better discriminates between values. I tried a couple of alternatives, but they didn't give good results. Perhaps you can do better.

Notice that the error function is 'denser' closer to 1, suggesting I should use a variable step size or a different algorithm. It might be that the closer you get to 1, the more errors and the effects of rounding overwhelm the calculation. I've played around with smaller step sizes and not had much luck.

Future work

If the data is Poisson distributed, there's an easier approach you can take. In a future blog post, I'll talk you through it.

Where to get the code

I've put the code on my Github page here: https://github.com/MikeWoodward/CodeExamples/blob/master/UnlabeledChart/approxrealgcd.py

$$\beta$$ is $$\alpha$$ if there's an effect

In hypothesis testing, there are two kinds of errors:

• Type I - we say there's an effect when there isn't. The threshold here is $$\alpha$$.
• Type II - we say there's no effect when there really is an effect. The threshold here is $$\beta$$.
This blog post is all about explaining and calculating $$\beta$$.

The null hypothesis

Let's say we do an A/B test to measure the effect of a change to a website. Our control branch is the A branch and the treatment branch is the B branch. We're going to measure the conversion rate $$C$$ on both branches. Here are our null and alternative hypotheses:

• $$H_0: C_B - C_A = 0$$ there is no difference between the branches
• $$H_1: C_B - C_A \neq 0$$ there is a difference between the branches

Remember, we don't know if there really is an effect, we're using procedures to make our best guess about whether there is an effect or not, but we could be wrong. We can say there is an effect when there isn't (Type I error) or we can say there is no effect when there is (Type II error).

Mathematically, we're taking the mean of thousands of samples so the central limit theorem (CLT) applies and we expect the quantity $$C_B - C_A$$ to be normally distributed. If there is no effect, then $$C_B - C_A = 0$$, if there is an effect $$C_B - C_A \neq 0$$.

$$\alpha$$ in a picture

Let's assume there is no effect. We can plot out our expected probability distribution and define an acceptance region (blue, 95% of the distribution) and two rejection regions (red, 5% of the distribution). If our measured $$C_B - C_A$$ result lands in the blue region, we will accept the null hypothesis and say there is no effect, If our result lands in the red region, we'll reject the null hypothesis and say there is an effect. The red region is defined by $$\alpha$$.

One way of looking at the blue area is to think of it as a confidence interval around the mean $$x_0$$:

$\bar x_0 + z_\frac{\alpha}{2} s \; and \; \bar x_0 + z_{1-\frac{\alpha}{2}} s$

In this equation, s is the standard error in our measurement. The probability of a measurement $$x$$ lying in this range is:

$0.95 = P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ]$

If we transform our measurement $$x$$ to the standard normal $$z$$, and we're using a 95% acceptance region (boundaries given by $$z$$ values of 1.96 and -1.96), then we have for the null hypothesis:

$0.95 = P[-1.96 < z < 1.96]$

$$\beta$$ in a picture

Now let's assume there is an effect. How likely is it that we'll say there's no effect when there really is an effect? This is the threshold $$\beta$$.

To draw this in pictures, I want to take a step back. We have two hypotheses:

• $$H_0: C_B - C_A = 0$$ there is no difference between the branches
• $$H_1: C_B - C_A \neq 0$$ there is a difference between the branches

We can draw a distribution for each of these hypotheses. Only one distribution will apply, but we don't know which one.

If the null hypothesis is true, the blue region is where our true negatives lie and the red region is where the false positives lie. The boundaries of the red/blue regions are set by $$\alpha$$. The value of $$\alpha$$ gives us the probability of a false positive.

If the alternate hypothesis is true, the true positives will be in the green region and the false negatives will be in the orange region. The boundary of the green/orange regions is set by $$\beta$$. The value of $$\beta$$ gives us the probability of a false negative.

Calculating $$\beta$$

Calculating $$\beta$$ is calculating the orange area of the alternative hypothesis chart. The boundaries are set by $$\alpha$$ from the null hypothesis. This is a bit twisty, so I'm going to say it again with more words to make it easier to understand.

$$\beta$$ is about false negatives. A false negative occurs when there is an effect, but we say there isn't. When we say there isn't an effect, we're saying the null hypothesis is true. For us to say there isn't an effect, the measured result must lie in the blue region of the null hypothesis distribution.

To calculate $$\beta$$, we need to know what fraction of the alternate hypothesis lies in the acceptance region of the null hypothesis distribution.

Let's take an example so I can show you the process step by step.

1. Assuming the null hypothesis, set up the boundaries of the acceptance and rejection region. Assuming a 95% acceptance region and an estimated mean of x, this gives the acceptance region as:
$P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ]$ which is the mean and 95% confidence interval for the null hypothesis. Our measurement $$x$$ must lie between these bounds.
2. Now assume the alternate hypothesis is true. If the alternate hypothesis is true, then our mean is $$\bar x_1$$.
3. We're still using this equation from before, but this time, our distribution is the alternate hypothesis.
$P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] ]$
4. Transforming to the standard normal distribution using the formula $$z = \frac{x - \bar x_1}{\sigma}$$, we can write the probability $$\beta$$ as:
$\beta = P \left [ \frac{\bar x_0 + z_\frac{\alpha}{2} s - \bar x_1}{s} < z < \frac{ \bar x_0 + z_{1-\frac{\alpha}{2}} s - \bar x_1}{s} \right ]$

This time, let's put some numbers in.

• $$n = 200,000$$ (100,000 per branch)
• $$C_B = 0.062$$
• $$C_A = 0.06$$
• $$\bar x_0= 0$$ - the null hypothesis
• $$\bar x_1 = 0.002$$ - the alternate hypothesis
• $$s = 0.00107$$  - this comes from combining the standard errors of both branches, so $$s^2 = s_A^2 + s_B^2$$, and I'm using the usual formula for the standard error of a proportion, for example $$s_A = \sqrt{\frac{C_A(1-C_A)}{n} }$$
Plugging them all in, this gives:
$\beta = P[ -3.829 < z < 0.090]$
which gives $$\beta = 0.536$$

This is too hard

This process is complex and involves lots of steps. In my view, it's too complex. It feels to me that there must be an easier way of constructing tests. Bayesian statistics holds out the hope for a simpler approach, but widespread adoption of Bayesian statistics is probably a generation or two away. We're stuck with an overly complex process using very difficult language.

Tuesday, July 6, 2021

Spritely fraud detection

Sadly, there's a long history of scientific fraud and misrepresentation of data. Modern computing technology has provided better tools for those trying to mislead, but the fortunate flip side is, modern tools provide ways of exposing misrepresented data. It turns out, the right tools can indicate what's really going on.

(Author: Nick Youngson. License: Creative Commons. Source: Wikimedia)

In business, companies often say they can increase sales, or reduce costs, or do so some other desirable thing. The evidence is sometimes in the form of summary statistics like means and standard deviations. Do you think you could assess the credibility of evidence based on the mean and standard deviation summary data alone?

In this blog post, I'm going to talk about how you can use one tool to investigate the credibility of mean and standard deviation evidence.

Discrete quantities

Discrete quantities are quantities that can only take discrete values. An example is a count, for example, a count of the number of sales. You can have 0, 1, 2, 3... sales, but you can't have -1 sales or 563.27 sales.

Some business quantities are measured on scales of 1 to 5 or 1 to 10, for example, net promoter scores or employee satisfaction scores. These scales are often called Likert scales.

For our example, let's imagine a company is selling a product on the internet and asks its customers how likely they are to recommend the product. The recommendation is on a scale of 0 to 10, where 0 is very unlikely to recommend and 10 is very likely to recommend. This is obviously based on the net promoter idea, but I'm simplifying things here.

Very unlikely to recommend                   Very likely to recommend
0 1 2 3 4 5 6 7 8 9 10

Imagine the salesperson for the company tells you the results of a 500 person study are a mean of 9 and a standard deviation of 2.5. They tell you that customers love the product, but obviously, there's some variation. The standard deviation shows you that not everyone's satisfied and that the numbers are therefore credible.

But are these numbers really credible?

Stop for a second and think about it. It's quite possible that their customers love the product. A mean of 9 on a scale of 10 isn't perfection, and the standard deviation of 2.5 suggests there is some variation, which you would expect. Would you believe these numbers?

Investigating credibility

We have three numbers; a mean, a standard deviation, and a sample size. Lots of different distributions could have given rise to these numbers, how can we backtrack to the original data?

The answer is, we can't fully backtrack, but we can investigate possibilities.

In 2018, a group of academic researchers in The Netherlands and the US released software you can use to backtrack to possible distributions from mean and standard deviation data. Their goal was to provide a tool to help investigate academic fraud. They wrote up how their software works and published it online, you can read their writeup here. They called their software SPRITE (Sample Parameter Reconstruction via Iterative TEchniques) and made it open-source, even going so far as to make a version of it available online. The software will show you the possible distributions that could give rise to the summary statistics you have.

One of the online versions is here. Let's plug in the salesperson's numbers to see if they're credible.

If you go to the SPRITE site, you'll see a menu on the left-hand side. In my screenshot, I've plugged in the numbers we have:

• Our scale goes from 0 to 10,
• Our mean is 9,
• Our standard deviation is 2.5,
• The number of samples is 500.
• We'll choose 2 decimal places for now
• We'll just see the top 9 possible distributions.

Here are the top 9 results.

Something doesn't smell right.  I would expect the data to show some form of more even distribution about the mean. For a mean of 9, I would expect there to be a number of 10s and a number of 8s too. These estimated distributions suggest that almost everyone is deliriously happy, with just a small handful of people unhappy. Is this credible in the real world? Probably not.

I don't have outright evidence of wrong-doing, but I'm now suspicious of the data. A good next step would be to ask for the underlying data. At the very least, I should view any other data the salesperson provides with suspicion. To be fair to the salesperson, they were probably provided with the data by someone else.

What if the salesperson had given me different numbers, for example, a mean of 8.5, a standard deviation of 1.2, and 100 samples? Looking at the results from SPRITE, the possible distributions seem much more likely. Yes, misrepresentation is still possible, but on the face of it, the data is credible.

Did you spot the other problem?

There's another, more obvious problem with the data. The scale is from 0 to 10, but the results are a mean of 9 and a standard deviation of 2.5, which imply a confidence interval of 6.5 to 11.5. To state the obvious, the maximum score is 10 but the upper range of the confidence interval is 11.5. This type of mistake is very common and doesn't of itself indicate fraud. I'll blog more about this type of mistake later.

What does this mean?

Due diligence is about checking claims for veracity before spending money. If there's a lot of money involved, it behooves the person doing the due diligence to check the consistency of the numbers they've been given. Tools like SPRITE are very helpful for sniffing out areas to check in more detail. However, just because a tool like SPRITE flags something up it doesn't mean to say there's fraud; people make mistakes with statistics all the time. However, if something is flagged up, you need to get to the bottom of it.

Finding out more

Saturday, February 27, 2021

Simpson's paradox: a trap for the naive analyst

Let's imagine you're the Chief Revenue Officer at a manufacturing company that sells tubes and cylinders. You're having trouble with European sales reps discounting, so you offer a spif: the country team that sells at the highest price gets a week-long vacation somewhere warm and sunny with free food and drink. The Italian and German sales teams are raring to go.

At the end of the quarter, you have these results [Wang]:

 Product type Cylinder Tube Sales team No sales Average price No sales Average price German 80 €100 20 €70 Italian 20 €120 80 €80

This looks like a clear victory for the Italians! They maintained a higher price for both cylinders and tubes! If they have a higher price for every item, then obviously, they've won. The Italians start packing their swimsuits.

Not so fast, say the Germans, let's look at the overall results.

 Sales team Average price German €94 Italian €88

Despite having a lower selling price for both cylinders and tubes, the Germans have maintained a higher selling price overall!

How did this happen? It's an instance of Simpon's paradox.

Why the results reversed

Here's how this happened: the Germans sold more of the expensive cylinders and the Italians sold more of the cheaper tubes. The average price is the ratio of the total monetary amount/total sales quantity. To put it very simply, ratios (prices) can behave oddly.

Let's look at a plot of the selling prices for the Germans and Italians.

The blue circles are tubes and the orange circles are cylinders. The size of the circles represents the number of sales. The little red dot in the center of the circles is the price.

Let's look at cylinders. Plainly, the Italians sold them at a higher price, but they're the most expensive item and the Germans sold more of them. Now, let's look at tubes, once again, the Italians sold them at a higher price than the Germans, but they're cheaper than cylinders and the Italians sold more of them.

You can probably see where this is going. Because the Italians sold more of the cheaper items, their average (or pooled) price is dragged down, despite maintaining a higher price on a per-item basis. I've re-drawn the chart, but this time I've added a horizontal black line that represents the average.

The product type (cylinders or tubes) is known in statistics as a confounder because it confounds the results. It's also known as a conditioning variable.

A disturbing example - does this drug work?

The sales example is simple and you can see the cause of the trouble immediately. Let's look at some data from a (pretend) clinical trial.

Imagine there's some disease that impacts men and women and that some people get better on their own without any treatment at all. Now let's imagine we have a drug that might improve patient outcomes. Here's the data [Lindley].

 Female Male Recovered Not recovered Rate Recovered Not recovered Rate Took drug 8 2 80% 12 18 40% Not take drug 21 9 70% 3 7 30%

Wow! The drug gives everyone an added 10% on their recovery rate. Surely we need to prescribe this for everyone? Let's have a look at the overall data.

 Everyone Recovered Not recovered Rate Took drug 20 20 50% Not take drug 24 16 60%

What this data is saying is, the drug reduces the recovery rate by 10%.

Let me say this again.

• For men, the drug improves recovery by 10%.
• For women, the drug improves recovery by 10%.
• For everyone, the drug reduces recovery by 10%.

If I'm a clinician, and I know you have the disease, if you're a woman, I would recommend you take the drug, if you're a man I would recommend you take the drug, but if I don't know your gender, I would advise you not to take the drug. What!!!!!

This is exactly the same math as the sales example I gave you above. The explanation is the same. The only thing different is the words I'm using and the context.

Simpson and COVID

In the United States, it's pretty well-established that black and hispanic people have suffered disproportionately from COVID. Not only is their risk of getting COVID higher, but their health outcomes are worse too. This has been extensively covered in the press and on the TV news.

In the middle of 2020, the CDC published data that showed fatality rates by race/ethnicity. The fatality rate means the fraction of patients with COVID who die. The data showed a clear result: white people had the worst fatality rate of the racial groups they studied.

Doesn't this contradict the press stories?

No.

There are three factors at work:
• The fatality rate increases with age for all ethnic groups. It's much higher for older people (75+) than younger people.
• The white population is older than the black and hispanic population.
• Whites have lower fatality rates in almost all age groups.
This is exactly the same as the German and Italian sales team example I started with. As a fraction of their population, there are more old white people than old black and hispanic people, so the fatality rates for the white population are dominated by the older age group in a way that doesn't happen for blacks and hispanics.

In this case, the overall numbers are highly misleading and the more meaningful comparison is at the age-group level. Mathematically, we can remove the effect of different demographics to make an apples-to-apples comparison of fatality rates, and that's what the CDC has done.

In pictures

Wikipedia has a nice article on Simpson's paradox and I particularly like the animation that's used to accompany it, so I'm copying it here.

Each of the dots represents a measurement, for example, it could be price. The colors represent categories, for example, German or Italian sales teams, etc. if we look at the results overall, the trend is negative (shown by the black dots and black line). If we look at the individual categories, the trend is positive (colors). In other words, the aggregation reverses the individual trends.

The classic example - sex discrimination at Berkeley

The Simpson's paradox example that's nearly always quoted is the Berkeley sex discrimination case [Bickel]. I'm not going to quote it here for two reasons: it's thoroughly discussed elsewhere, and the presentation of the results can be confusing. I've stuck to simpler examples to make my point.

American politics

A version of Simpson's paradox can occur in American presidential elections, and it very nicely illustrates the cause of the problem.

In 2016, Hilary Clinton won the popular vote by 48.2% to 46.1%, but Donald Trump won the electoral college by 304 to 227. The reason for the reversal is simple, it's the population spread among the states and the relative electoral college votes allocated to the states. As in the case of the rollup with the sales and medical data I showed you earlier, exactly how the data rolls up can reverse the result.

The question, "who won the 2016 presidential election" sounds simple, but it can have several meanings:

• who was elected president
• who got the most votes
• who got the most electoral college votes
The most obvious meaning, in this case, is, "who was elected president". But when you're analyzing data, it's not always obvious what the right question really is.

The root cause of the problem

The problem occurs because we're using an imprecise language (English) to interpret mathematical results. In the sales and medical data cases, we need to define what we want.

In the sales price example, do we mean the overall price or the price for each category? The contest was ambiguous, but to be fair to our CRO, this wasn't obvious initially. Probably, the fairest result is to take the overall price.

For the medical data case, we're probably better off taking the male and female data separately. A similar argument applies for the COVID example. The clarifying question is, what are you using the statistics for? In the drug data case, we're trying to understand the efficacy of a drug, and plainly, gender is a factor, so we should use the gendered data. In the COVID data case, if we're trying to understand the comparative impact of COVID on different races/ethnicities, we need to remove demographic differences.

If this was the 1980s, we'd be stuck. We can't use statistics alone to tell us what the answer is, we'd have to use data from outside the analysis to help us [Pearl]. But this isn't the 1980s anymore, and there are techniques to show the presence of Simpson's paradox. The answer lies in using something called a directed acyclic graph, usually called a DAG. But DAGs are a complex area and too complex for this blog post that I'm aiming at business people.

What this means in practice

There's a very old sales joke that says, "we'll lose money on every sale but make it up in volume". It's something sales managers like to quote to their salespeople when they come asking for permission to discount beyond the rules. I laughed along too, but now I'm not so quick to laugh. Simpson's paradox has taught me to think before I speak. Things can get weird.

Interpreting large amounts of data is hard. You need training and practice to get it right and there's a reason why seasoned data scientists are sought after. But even experienced analysts can struggle with issues like Simpson's paradox and multi-comparison problems.

The red alert danger for businesses occurs when people who don't have the training and expertise start to interpret complex data. Let's imagine someone who didn't know about Simpson's paradox had the sales or medical data problem I've described here. Do you think they could reach the 'right' conclusion?

The bottom line is simple: you've got to know what you're doing when it comes to analysis.

References

[Bickel] Sex Bias in Graduate Admissions: Data from Berkeley, By P. J. Bickel, E. A. Hammel, J. W. O'Connell, Science, 07 Feb 1975: 398-404
[Lindley] Lindley, D. and Novick, M. (1981). The role of exchangeability in inference. The Annals
of Statistics 9 45–58.
[Pearl] Judea Pearl, Comment: Understanding Simpson’s Paradox, The American Statistician, 68(1):8-13, February 2014.
[Wang] Wang B, Wu P, Kwan B, Tu XM, Feng C. Simpson's Paradox: Examples. Shanghai Arch Psychiatry. 2018;30(2):139-143. doi:10.11919/j.issn.1002-0829.218026

Monday, January 4, 2021

COVID and soccer home team advantage - winning less often

Is it easier for a sports team to win at home? The evidence from sports as diverse as soccer [Pollard], American football [Vergina], rugby [Thomas], and ice hockey [Leard] strongly suggest there is a home advantage and it might be quite large. But what causes it? Is it the crowd cheering the home team, or closeness to home, or playing on familiar turf? One of the weirder side-effects of COVID is the insight it's proving into the origins of home advantage, as we'll see.

(Premier League teams playing in happier times. Image source: Wikimedia Commons, License: Creative Commons, Author: Brian Minkoff)

The EPL - lots of data makes analysis easier

The English Premier League is the world's wealthiest sports' league [Robinson].  There's worldwide interest in the league and there has been for a long time, so there's a lot of data available, which makes it ideal for investigating home advantage. One of the nice features of the league is that each team plays every other team twice, once at home and once away.

Expectation and metric

If there were no home team advantage, we would expect the number of home wins and away wins to be roughly equal for the whole league in a season. To investigate home advantage, the metric I'll use is:
$home \ win \ proportion = \frac{number\ of\ home\ wins}{total\ number\ of\ wins}$
If there were no home team advantage, we would expect this number to be close to 0.5.

Let's look at the mean home win proportion per season for the EPL. In the chart, the error bars are the 95% confidence interval.
For most seasons, the home win proportion is about 0.6 and it's significantly above 0.5 (in the statistical sense). In other words, there's a strong home-field advantage in the EPL.

But look at the point on the right. What's going on in 2020-2021?

COVID and home wins

Like everything else in the world, the EPL has been affected by COVID. Teams are playing behind closed doors for the 2020-2021 season. There are no fans singing and chanting in the terraces, there are no fans 'ohhing' over near misses, and there are no fans cheering goals. Teams are still playing matches home and away but in empty and silent stadiums.

So how has this affected home team advantage?

Take a look at the chart above. The 2020-2021 season is the season on the right. Obviously, we're still partway through the season, which is why the error bars are so big, but look at the mean value. If there were no home team advantage, we would expect a mean of 0.5. For 2020-2021, the mean is currently 0.491.

Let me put this simply. When there are fans in the stadiums, there's a home team advantage. When there are no fans in the stadiums, the home team advantage disappears.

COVID and goals

What about goals? It's possible that a team that might have lost is so encouraged by their fans that they reach a draw instead. Do teams playing at home score more goals?

I worked out the mean goal difference between the home team and the away team and I've plotted it for every season from 2000-2001 onwards.
If there were no home team advantage, you would expect the goal difference to be 0. But it isn't. It mostly hovers around 0.35. Except of course for 2020-2021. For 2020-2021, the goal difference is about zero. The home-field advantage has gone.

What this means

Despite the roll-out of the vaccine, it's almost certain the rest of the 2020-2021 season will be played behind closed doors (assuming the season isn't abandoned). My results are for a partial season, but it's a good bet the final results will be similar. If this is the case, then it will be very strong evidence that fans cheering their team really do make a difference.

If you want your team to win, you need to go to their games and cheer them on.

References

[Leard] Leard B, Doyle JM. The Effect of Home Advantage, Momentum, and Fighting on Winning in the National Hockey League. Journal of Sports Economics. 2011;12(5):538-560.

[Pollard] Richard Pollard and Gregory Pollard, Home advantage in soccer: a review of its existence and causes, International Journal of Soccer and Science Journal Vol. 3 No 1 2005, pp28-44

[Robinson] Joshua Robinson, Jonathan Clegg, The Club: How the English Premier League Became the Wildest, Richest, Most Disruptive Force in Sports, Mariner Books, 2019

[Thomas] Thomas S, Reeves C, Bell A. Home Advantage in the Six Nations Rugby Union Tournament. Perceptual and Motor Skills. 2008;106(1):113-116

[Vergina] Roger C.Vergina, John J.Sosika, No place like home: an examination of the home field advantage in gambling strategies in NFL football, Journal of Economics and Business Volume 51, Issue 1, January–February 1999, Pages 21-31

Why should you care about probability distributions?

Using the wrong probability distribution can be extremely expensive for businesses:

• for businesses using machinery (factories, vehicles, aircraft, etc.), it can lead to parts being changed too frequently or too infrequently
• for business relying on returning customers, it can lead to substantial under or over-estimates of revenue and/or targeting the wrong customers with promotions
• for businesses forecasting future sales by territory and/or product, it can lead to poor territory allocation or poor product resource allocation.

Given that it's so important, what is a probability distribution, and what are some examples?

What's a probability distribution?

At its simplest, a probability distribution tells you how likely an outcome is given some input. For example, how is sales probability distributed by price, or how likely is a component to fail in the next month?

If something is certain to occur, the probability is 1, if it's certain not to occur, the probability is zero.  Let's imagine a component lasts a maximum of 6 months before failure. Our probability distribution might show the probability of failure on days 1 to 180. The sum of all failure probabilities for all days must sum to 1.

In the real world, data is noisy and we don't expect real data to exactly follow theoretical distributions, but given enough data, the match should be close enough for us to reason about what's going on.

Uniform distribution - gambling and manufacturing

If the probability is the same for all input values, the distribution is uniform.

Let's imagine we're manufacturing candy, and we want to have equal numbers of red, blue, green, black, and white sweets in a packet. In theory, here's what we should observe.

But in reality, there's random noise so we might see something like this below. We can quantify the difference between the expected distribution and the actual distribution, which tells us something about the variability in the manufacturing process.

The uniform distribution also occurs in gambling, for example, lotteries or dice games.

Uniform distribution description by NIST

Binomial distribution - pass/fail and conversion

Each customer who comes into a store or who visits a website will either buy or not buy, which we can turn into a conversion rate. We can model these kinds of pass/fail processes using the binomial distribution. Here's the probability distribution.

The binomial distribution shows us the probability of measuring different results given an underlying 'truth'. Let's imagine the 'true' conversion rate was 0.04, we might not measure 0.04 due to sampling error, instead, we might measure 0.045 or 0.055, depending on how many samples we take. It's important to understand what this means:

• There is uncertainty in our measurement.
• The smaller the sample, the bigger the uncertainty.

Although many technical people understand this, most non-technical people do not, which can lead to tension.

Yale stats

Poisson distribution - waiting in line

Imagine you're a bank serving customers with ATMs at a location. ATMs are expensive, but you don't want to keep people waiting in long lines to do their transactions, it's bad for business. So how do you balance the cost of an ATM against its use? By modeling how many people are using the ATM over a time period.

It turns out, the number of people who visit an ATM over a time period can be modeled using the Poisson distribution, which I've shown below. This gives us a way of assessing how much variation there might be in usage and therefore how many machines we might want to install.

The Poisson distribution is often used to model counting processes. It's very attractive because it has an unusual feature, the standard deviation for the distribution is $$\sqrt{\gamma}$$ where $$\gamma$$ is the mean. Unfortunately, this property makes it a little too attractive; it's sometimes used when it shouldn't be.

The Poisson Distribution and Poisson Process Explained

Exponential distribution

How long does a car battery last? How long do phone calls last? When will the next earthquake occur? These durations typically follow the exponential distribution (which is strongly related to the Poisson distribution). I've shown this distribution below.

The exponential distribution

Power law distribution - finding fraud

How are incomes distributed in a population? How might you find fraud in the pattern of digits in expenses? It turns out, the distribution of the first digits in invoices follows a power-law distribution. The chart below shows a generic power-law distribution - for fraud detection, it's 'flipped'.

Power law distribution

Normal distribution - almost everywhere, but not quite

What's the probability distribution for male soldiers' chest measurements? How are the results of A/B tests distributed? What about the distribution of measurement errors? All these, and many, many more follow the normal distribution, which is also called the Gaussian distribution or the bell curve. If you only learn one distribution, this is the one to learn.

The properties of this distribution are extremely well-known, and every student of statistics and probability theory will know them. It's ubiquitous because of something called the Central Limit Theorem, which, simplifying a great deal, says that the sum of samples from any distribution follows a normal distribution.

Because it's everywhere, for some people, it's the only distribution they know. Like the old saying goes, if you only have a hammer, every problem is a nail. This distribution can be over-used, with bad consequences.

Here's the distribution. It ought to look familiar.

The normal distribution

Lognormal distribution

How long do visitors spend on web pages? What about the distribution of internet traffic? Or the distribution of city sizes? These all follow a log-normal distribution that looks like the example below. The lognormal distribution is quite common in business.

Note the 'fat tail' or 'long tail' on the right-hand side. Many businesses have been caught out because they assumed sales or market risk followed a normal distribution when in fact they followed a lognormal distribution.

There's a variation of the Central Limit Theorem that yields log-normal distributions instead of normal distributions.

Other distributions

There are lots and lots of different distributions. I saw a list of 90 the other day. Almost all of them are esoteric and apply in a very limited set of cases. You don't have to know all of them but you should be aware that choosing the right distribution is important to make the correct estimates. The distributions I've listed in this blog post are probably the most important, and you should know them and their properties.

As you asked nicely, here is a list of some distributions.

Alpha Distribution
Anglit Distribution
Arcsine Distribution
Beta Distribution
Beta Prime Distribution
Burr Distribution
Burr12 Distribution
Cauchy Distribution
Chi Distribution
Chi-squared Distribution
Cosine Distribution
Double Gamma Distribution
Double Weibull Distribution
Erlang Distribution
Exponential Distribution
Exponentiated Weibull Distribution
Exponential Power Distribution
Fatigue Life (Birnbaum-Saunders) Distribution
Fisk (Log Logistic) Distribution
Folded Cauchy Distribution
Folded Normal Distribution
Fratio (or F) Distribution
Gamma Distribution
Generalized Logistic Distribution
Generalized Pareto Distribution
Generalized Exponential Distribution
Generalized Extreme Value Distribution
Generalized Gamma Distribution
Generalized Half-Logistic Distribution
Generalized Inverse Gaussian Distribution
Generalized Normal Distribution
Gilbrat Distribution
Gompertz (Truncated Gumbel) Distribution
Gumbel (LogWeibull, Fisher-Tippetts, Type I Extreme Value) Distribution
Gumbel Left-skewed (for minimum order statistic) Distribution
HalfCauchy Distribution
HalfNormal Distribution
Half-Logistic Distribution
Hyperbolic Secant Distribution
Gauss Hypergeometric Distribution
Inverted Gamma Distribution
Inverse Normal (Inverse Gaussian) Distribution
Inverted Weibull Distribution
Johnson SB Distribution
Johnson SU Distribution
KSone Distribution
KStwo Distribution
KStwobign Distribution
Laplace (Double Exponential, Bilateral Exponential) Distribution
Left-skewed Lévy Distribution
Lévy Distribution
Logistic (Sech-squared) Distribution
Log Double Exponential (Log-Laplace) Distribution
Log Gamma Distribution
Log Normal (Cobb-Douglass) Distribution
Log-Uniform Distribution
Maxwell Distribution
Mielke’s Beta-Kappa Distribution
Nakagami Distribution
Noncentral chi-squared Distribution
Noncentral F Distribution
Noncentral t Distribution
Normal Distribution
Normal Inverse Gaussian Distribution
Pareto Distribution
Pareto Second Kind (Lomax) Distribution
Power Log Normal Distribution
Power Normal Distribution
Power-function Distribution
R-distribution Distribution
Rayleigh Distribution
Rice Distribution
Reciprocal Inverse Gaussian Distribution
Semicircular Distribution
Student t Distribution
Trapezoidal Distribution
Triangular Distribution
Truncated Exponential Distribution
Truncated Normal Distribution
Tukey-Lambda Distribution
Uniform Distribution
Von Mises Distribution
Wald Distribution
Weibull Maximum Extreme Value Distribution
Weibull Minimum Extreme Value Distribution
Wrapped Cauchy Distribution

Continuous or discrete - shaken or stirred?

Some quantities are discrete and some are continuous. A discrete quantity is something like a sales territory (e.g. Germany, Ireland, Spain) or customer count (you can't have 0.5 of a customer). A continuous quantity can take any value, for example, speed can be 45.2 kph, 120.01 kph, and so on. Some distributions apply to both continuous and discrete, and some apply only to continuous or discrete. To muddy the waters, sometimes continuous distributions are used to approximately model discrete quantities.

Vehicles

Imagine you're running a delivery vehicle fleet. You need to keep your vehicles on the road, but you need to keep an eye on maintenance costs. You decide to use math to guide your decisions, so you work out the average lifetime for different components. You have two components A and B with the same lifetimes in miles. If either component fails, you have to tow the vehicle, which is very expensive.

• Component A. Lifetime is 150,000 miles.
• Component B. Lifetime is 150,000 miles.

A vehicle comes in for maintenance with 149,000 miles on the odometer. Should you replace components A and B?

As you might expect, there's a gotcha. Without knowing the probability distribution for failures, we can't make these decisions. For example, a windshield might have a uniform failure rate distribution, with the probability of failure for miles 1-100 the same as the probability of failure for miles 100,000-100,100. A clutch may have a failure rate that increases with mileage, the probability of failure at miles 100,000-100,100 being much higher than the probability of failure at miles 0-100. Because we know what a clutch and a windshield are, we might decide to replace the clutch and leave the windshield. But what if A and B were a serpentine belt and a heat shield?

The only way to make rational decisions is to understand what distribution the probability of failure follows, which may well be very different for different components (e.g. car seats vs. tires).

Marketing

A new analyst is studying the market for luxury goods in Germany. They have partial data for the fraction of the population that have a certain income. Using what they have, they assume their data is normally distributed and they make a forecast for the fraction of the population that will have an income high enough to afford luxury items. Do you think their forecast will be too low, just right, or too high?

Incomes are usually log-normally distributed, so the analyst, in this case, has chosen the wrong distribution. Because the lognormal has a very long right tail, the analyst's estimate is likely to be an underestimate and may be substantially out. A competitor might not make the same mistake.

Takeaways

I've interviewed people who claim data science on their resumes, but only know the normal distribution. If you assume your data is normal, when in reality it's log-normal or Poisson, things are going to go badly wrong for you. Any analyst in business needs to be very comfortable with different distributions and needs to know which may be applicable and when.

An offer you can't refuse?

Imagine you're in a casino playing craps, a game where you bet on the outcome of two dice thrown at the same time. The probability of a double six coming up is 1/36, but no-one has thrown a double six for over 110 throws. The table is starting to get crowded and noisy with people betting on a double six. It's due to come up, and it must come up soon.

(Still no double six. Source: Wikimedia Commons. License: Creative Commons. Author: Gaz.)

A new player rolls the dice; snake-eyes (double ones) - still no double six.

You feel a tap on your elbow. A lady in a cocktail dress whispers to you that she'll give you odds of 20 to 1 for a double six.

Another player rolls the dice; easy-four (one and three) - the expectation for a double six mounts.

Your new friend whispers that she'll reduce the odds soon; she asks if you want to take the bet.

It's now 130 throws since a double six has occurred and it should have occurred 3 or 4 times by now.

Do you take the bet?

The gambler's fallacy

The gambler's fallacy is the belief that the outcome of a random event is somehow influenced by the previous random events. In our craps case, some examples might be:

• double six hasn't come up in 130 throws, so it's much more likely to come up now (the probability is higher than 1/36)
• double one has just come up, therefore it's not likely to come up again soon (the probability is less than 1/36).

It's a fallacy because each roll of the dice is completely independent; it doesn't matter what the previous throws were. There could have been 1,000 throws without a double six, but the probability of a double six will always be 1/36. The same thing for the snake-eyes example, if a snake-eyes has been thrown, the probability of throwing another snake-eyes immediately after is still 1/36.

Let me lay this out even more starkly, in craps:

• At the very first roll of the dice, the probability of a double six is 1/36.
• After ten rolls of the dice, the probability of the next roll being a double six are 1/36.
• After 100 rolls without a double six, the probability of the next roll being a double six is 1/36.
• After 200 rolls without a double six, the probability of the next roll being a double six is 1/36.
• After 1,000 rolls without a double six, the probability of the next roll being a double six is 1/36.

Otherwise rational people are fooled by the gambler's fallacy all the time. As the money increases and the emotion heightens, the gambler's fallacy becomes easier and easier to fall for, as we'll see.

The Italian lottery

The story starts in Venice, Italy in May 2003. The Venice lottery was a game where 6 numbered balls (plus a bonus ball) were selected from a set of 90 numbered balls. The lottery was run twice a week. Each number should come out on average once every 7-8 weeks. As with all government-sponsored lotteries, the results were well-publicized.

In May 2003, the number 53 came up. Then it didn't come up again.

By October, people realized the number 53 was overdue. They started to gamble on 53 occurring - it was overdue, so it must come up. But 53 just didn't come up.

News of the 53 drought started to spread, and more and more Italians started to bet that 53 would occur, but it didn't. It didn't come up in November or December either.

In January of 2004, a woman from Carrara committed suicide because she'd spent her family's life-saving gambling that 53 would come up. It didn't.

Still, 53 didn't come up.

People went crazy betting money that 53 would come up, they became known as '53 addicts'. They were sure it must come up. Sadly, it didn't. A man from Signa shot his wife, his son, and himself after losing money gambling on 53.

Still, 53 didn't come up.

Italians gambled and lost a huge amount of money on 53, an estimated 4 billion Euros. They had fallen for the gambler's fallacy and believed that 53 must come up soon.

Eventually, 53 did come up - in February 2005, after 182 draws (remember, each draw was seven balls).

The Venice lottery made a lot of money, but the Italian gamblers did not.

How the cocktail dress lady (and casinos) makes money

To understand if the cocktail dress lady was offering a good deal, we need to relate probability to odds.

The probability of a double six is 1/36.

The odds are the ratio of the probability the event will occur divided by the probability the event will not occur:

$odds = \frac{P}{1-P}$

The odds of a double six are:

$odds_{66} = \frac{\frac{1}{36}}{\frac{35}{36}} = \frac {1}{35}$

which a bookie might quote as 35 to 1.

Generally speaking, casinos and bookies make money in one of two ways:

• The probabilities don't add up to 1.
• They rely on the gamblers' fallacy and offer worse odds than a fair analysis would suggest.

Let's imaging there are ten horses in a race. Each horse has a 10% chance of winning, which are odds of 9 to 1. If you win, you get your stake money back, so a winning bet of $1 gives you$10. If ten punters bet $1 on each horse, the bookie takes$10, but one of the horses must win, so the bookie pays out $10. (Bookmakers make money. You don't. Image source: Wikimedia Commons. License: Creative Commons. Author: Grand Island Tourism ) To make money, the bookie reduces the odds. Instead of offering 9 to 1 on each of the horses, the bookie offers 8 to 1. The bookie still takes in$10, but this time only pays out \$9. In the real world, it's more complicated, but you get the idea.

The other way to make money is to underprice probabilities. A double six should be offered at 35 to 1, but you could offer it at 20 to 1. This is a horrible deal, but if gamblers have a bad case of the gambler's fallacy, they may be convinced the probability is much higher than 1/36 and they may even view a horrible deal as the deal of a lifetime. The casino, or the lady in the cocktail dress, make money by knowing the odds and knowing when to offer a deal that seems attractive, but isn't.

Not only should you not accept the 20 to 1 offer, you should offer it to other players.

Gambler's fallacy in Reno, Nevada and Monte Carlo

Obviously, there are naive gamblers in Las Vegas, but do people really fall for the gamblers' fallacy at the roulette table? After all, you have to have some level of sophistication to understand and play the game, so surely gamblers are savvy and know how to price bets appropriately? It seems that they don't always.

Using videotape data supplied by a casino in Reno, Nevada, two researchers tracked the pattern of gambling on roulette. If gamblers have fallen for the gambler's fallacy, you might expect to see certain patterns of betting, for example, if red hasn't come up as often as expected, they might bet more on red. The researchers found small, but significant examples of the gambler's fallacy The reality is then, there are people who fall for the fallacy, even those playing a sophisticated game like roulette.

(Image source: Wikimedia Commons. License: Creative Commons. Author: Ken Lund.)

Another object lesson in the gambler's fallacy occurred at a roulette table in a casino, this time at a casino in Monte Carlo. In 1943, the ball landed on red 32 times in a row. The people who thought black must come up were cleaned out.

The gamblers' fallacy elsewhere

The gambler's fallacy has been an active area of research for some time. Variations of it have been found in different places:

Let's imagine you're an asylum judge. You're aware of the average 'success' rate for applicants and you don't want to be too far from the average. Let's assume the cases are randomly assigned (deserving and undeserving). By random chance, you might get a long string of deserving or undeserving cases, maybe as many as twenty in a row. The gambler's fallacy may kick in after a series of similar cases, for example, the first ten cases were deserving, so the eleventh 'must' be undeserving, as a result, you judge more harshly based on expectation.

If you listen closely enough, you hear business people make the gambler's fallacy all the time. How often have you heard these kinds of phrases:

• We've won the last 8 contracts, so we must win the next one.
• We just failed to land the last 6 deals, so the odds of us landing the next deal are high.

Despite what people say, business can be strongly driven by belief and not rationality. If everyone needs a deal to be landed, then the collective view might become that a deal will be landed, regardless of what a realistic measure of the probabilities are.

How to guard against the gamblers' fallacy

There's something about humanity and our (mis)understanding of statistics that makes us vulnerable to the gambler's fallacy. The best teacher might be experience. How many Italians who bet on 53 would do so again? There's some evidence that the gambler's fallacy is particularly strong when the data evolves over time, which ties in with the Italian lottery and casino examples. Perhaps the best defense is to take a step back and view the data as a whole, then make a decision away from the influence of others.

The existence of opulent casinos should be a lesson that those who understand probability can make money from those who do not.

The summary is not the whole picture

If you just use summary statistics to describe your data, you can miss the bigger picture, sometimes literally so. In this blog post, I'm going to show you how relying on summaries alone can lead you catastrophically astray and I'm going to tell you how you can avoid making career-damaging mistakes.

The datasaurus is why you need to visualize your data. Source: Alberto Cairo. Open source.

What are summary statistics?

Summary statistics are parameters like the mean, standard deviation, and correlation coefficient; they summarize the properties of the data and the relationship between variables. For example, if the correlation coefficient, r, is about 0.8 for two data sets x and y, we might think there's a relationship between them, but if it's about 0, we might think there isn't.

The use of summary statistics is widely taught, every textbook emphasizes them, and almost everyone uses them. But if you use summary statistics in isolation from other methods you might miss important relationships - you should always visualize your data as we'll see.

Anscombe's Quartet

Take a look at the four plots below. They're obviously quite different, but they all have the same summary statistics!

Here are the summary statistics data:

PropertyValue
Mean of x9
Sample variance of x : $\sigma ^{2}$ 11
Mean of y7.50
Sample variance of y : $\sigma ^{2}$ 4.125
Correlation between x and y0.816
Linear regression liney = 3.00 + 0.500x
Coefficient of determination of the linear regression : $R^{2}$ 0.67

These plots were developed in 1973 by the statistician Francis Anscombe to make exactly this point: you can't rely on summary statistics, you need to visualize your data. The graphical relationship between the x and y variables are different in each case and imply different things. By plotting the data out, we can see what the relationships are, but summary statistics hide what's going on.

The datasaurus

Let's zoom forward to 2016. The justly famous Alberto Cairo tweeted about Anscombe's quartet and illustrated the point with this cool set of summary statistics. He later expanded on his tweet in a short blog post.

Property Value
n 142
mean 54.2633
x standard deviation 16.7651
y mean 47.8323
y standard deviation 26.9353
Pearson correlation -0.0645

What might you conclude from these summary statistics? I might say, the correlation coefficient is close to zero so there's not much of a relationship between the x and the y variables. I might conclude there's no interesting relationship between the x and y variables - but I would be wrong.

The summary might not mean anything to you, but the visualization surely will. This is the datasaurus data set, the x and the y variables draw out a dinosaur.

The datasaurus dozen

Two researchers at Autodesk Research took things a stage further. They started with Alberto Cairo's datasaurus and created a dozen other charts with exactly the same summary statistics as the datasaurus. Here they all are.

The summary statistics look like noise, but the charts reveal the underlying relationships between the x and y variables. Some of these relationships are obviously fun, like the star, but there are others that imply more meaningful relationships.

If all this sounds a bit abstract, let think about how this might manifest itself in business. Let's imagine you're an analyst working for a large company. You have data on sales by store size for Europe and you've been asked to analyze the data to gain insights. You're under time pressure, so you fire up a Python notebook and get some quick summary statistics. You get summary statistics that look like the ones I showed you above. So you conclude there's nothing interesting in the data; but you might be very wrong.

You should plot the data out and look at the chart. You might see something that looks like the slanting charts above, maybe something like this:

the individual diagonal lines might correspond to different European countries (different regulations, different planning rules, different competition, etc.). There could be a very significant relationship that you would have missed by relying on summary data.

(The Autodesk Research team haves posted their work as a paper you can read here.)

Lessons learned

The lessons you should take away from all this are simple:
• summary statistics hide a lot
• there are many relationships between variables that will give summary statistics that look like noise