Showing posts with label data science. Show all posts
Showing posts with label data science. Show all posts

# Knowing the Central Limit Theorem means avoiding costly mistakes

I've spoken to well-meaning analysts who've made significant mistakes because they don't understand the implications of one of the core principles of statistics; the Central Limit Theorem (CLT). These errors weren't trivial either, they affected salesperson compensation and the analysis of A/B tests. More personally, I've interviewed experienced candidates who made fundamental blunders because they didn't understand what this theorem implies.

The CLT is why the mean and standard deviation work pretty much all the time but it's also why they only work when the sample size is "big enough". It's why when you're estimating the population mean it's important to have as large a sample size as you can. It's why we use the Student's t-test for small sample sizes and why other tests are appropriate for large sample sizes.

In this blog post, I'm going to explain what the CLT is, some of the theory behind it (at a simple level), and how it drives key business statistics. Because I'm trying to communicate some fundamental ideas, I'm going to be imprecise in my language at first and add more precision as I develop the core ideas. As a bonus, I'll throw in a different version of the CLT that has some lesser-known consequences.

# How we use a few numbers to represent a lot of numbers

In all areas of life, we use one or two numbers to represent lots of numbers. For example, we talk about the average value of sales, the average number of goals scored per match, average salaries, average life expectancy, and so on. Usually, but not always, we get these numbers through some form of sampling, for example, we might run a salary survey asking thousands of people their salary, and from that data work out a mean (a sampling mean). Technically, the average is something mathematicians call a "measure of central tendency" which we'll come back to later.

We know not everyone will earn the mean salary and that in reality, salaries are spread out. We express the spread of data using the standard deviation. More technically, we use something called a confidence interval which is based on the standard deviation. The standard deviation (or confidence interval) is a measure of how close we think our sampling mean is to the true (population) mean.

In practice, we use standard formula for the mean and standard deviation. These are available as standard functions in spreadsheets and programming languages. Mathematically, this is how they're expressed.

$sample\; mean\; \bar{x}= \frac{1}{N}\sum_{i=0}^{N}x_i$

$sample\; standard\; deviation\; s_N = \sqrt{\frac{1}{N} \sum_{i=0}^{N} {\left ( x_i - \bar{x} \right )} ^ 2 }$

All of this seems like standard stuff, but there's a reason why it's standard, and that's the central limit theorem (CLT).

# The CLT

Let's look at three different data sets with different distributions: uniform, Poisson, and power law as shown in the charts below.

These data sets are very, very different. Surely we have to have different averaging and standard deviation processes for different distributions? Because of the CLT, the answer is no.

In the real world, we sample from populations and take an average (for example, using a salary survey), so let's do that here. To get going, let's take 100 samples from each distribution and work out a sample mean. We'll do this 10,000 times so we get some kind of estimate for how spread out our sample means are.

The top charts show the original population distribution and the bottom charts show the result of this sampling means process. What do you notice?

The distribution of the sampling means is a normal distribution regardless of the underlying distribution.

This is a very, very simplified version of the CLT and it has some profound consequences, the most important of which is that we can use the same averaging and standard deviation functions all the time.

# Some gentle theory

Proving the CLT is very advanced and I'm not going to do that here. I am going to show you through some charts what happens as we increase the sample size.

Imagine I start with a uniform random distribution like the one below.

I want to know the mean value, so I take some samples and work out a mean for my samples. I do this lots of times and work out a distribution for my mean. Here's what the results look like for a sample size of 2, 3,...10,...20,...30,...40.

As the sample size gets bigger, the distribution of the means gets closer to a normal distribution. It's important to note that the width of the curve gets narrower with increasing sample size. Once the distribution is "close enough" to the normal distribution (typically, around a sample size of 30), you can use normal distribution methods like the mean and standard deviation.

The standard deviation is a measure of the width of the normal distribution. For small sample sizes, the standard deviation underestimates the width of the distribution, which has important consequences.

Of course, you can do this with almost any underlying distribution, I'm just using a uniform distribution because it's easier to show the results

# Implications for averages

The charts above show how the distribution of the means changes with sample size. At low sample sizes, there are a lot more "extreme" values as the difference between the sample sizes of 2 and 40 shows.  Bear in mind, the width of the distribution is an estimate of the uncertainty in our measurement of the mean.

For small sample sizes, the mean is a poor estimator of the "average" value; it's extremely prone to outliers as the shape of the charts above indicates. There are two choices to fix the problem: either increase the sample size to about 30 or more (which often isn't possible) or use the median instead (the median is much less prone to outliers, but it's harder to calculate).

The standard deviation (and the related confidence interval) is a measure of the uncertainty in the mean value. Once again, it's sensitive to outliers. For small sample sizes, the standard deviation is a poor estimator for the width of the distribution. There are two choices to fix the problem, either increase the sample size to 30 or more (which often isn't possible) or use quartiles instead (for example, the interquartile range, IQR).

If this sounds theoretical, let me bring things down to earth with an example. Imagine you're evaluating salesperson performance based on deals closed in a quarter. In B2B sales, it's rare for a rep to make 30 sales in a quarter, in fact, even half that number might be an outstanding achievement. With a small number of samples, the distribution is very much not normal, and as we've seen in the charts above, it's prone to outliers. So an analysis based on mean sales with a standard deviation isn't a good idea; sales data is notorious for outliers. A much better analysis is the median and IQR. This very much matters if you're using this analysis to compare rep performance.

# Implications for statistical tests

A hundred years ago, there were very few large-scale tests, for example, medical tests typically involved small numbers of people. As I showed above, for small sample sizes the CLT doesn't apply. That's why Gosset developed the Student's t-distribution: the sample sizes were too small for the CLT to kick in, so he needed a rigorous analysis procedure to account for the wider-than-normal distributions. The point is, the Student's t-distribution applies when sample sizes are below about 30.

Roll forward 100 years and we're now doing retail A/B testing with tens of thousands of samples or more. In large-scale A/B tests, the z-test is a more appropriate test. Let me put this bluntly: why would you use a test specifically designed for small sample sizes when you have tens of thousands of samples?

It's not exactly wrong to use the Student's t-test for large sample sizes, it's just dumb. The special features of the Student's t-test that enable it to work with small sample sizes become irrelevant. It's a bit like using a spanner as a hammer; if you were paying someone to do construction work on your house and they were using the wrong tool for something simple, would you trust them with something complex?

I've asked about statistical tests at interview and I've been surprised at the response. Many candidates have immediately said Student's t as a knee-jerk response (which is forgivable). Many candidates didn't even know why Student's t was developed and its limitations (not forgivable for senior analytical roles). One or two even insisted that Student's t would still be a good choice even for sample sizes into the hundreds of thousands. It's very hard to progress candidates who insist on using the wrong approach even after it's been pointed out to them.

As a practical matter, you need to know what statistical tools you have available and their limitations.

# Implications for sample sizes

I've blithely said that the CLT applies above a sample size of 30. For "most" distributions, a sample size of about 30 is a reasonable rule-of-thumb, but there's no theory behind it. There are cases where a sample size of 30 is insufficient.

At the time of writing, there's a discussion on the internet about precisely this point. There's a popular article on LessWrong that illustrates how quickly convergence to the normal can happen: https://www.lesswrong.com/posts/YM6Qgiz9RT7EmeFpp/how-long-does-it-take-to-become-gaussian but there's also a counter article that talks about cases where convergence can take much longer: https://two-wrongs.com/it-takes-long-to-become-gaussian

The takeaway from this discussion is straightforward. Most of the time, using a sample size of 30 is good enough for the CLT to kick-in, but occasionally you need larger sample sizes. A good way to test this is to use larger sample sizes and see if there's any trend in the data.

# General implications

The CLT is a double-edged sword: it enables us to use the same averaging processes regardless of the underlying distribution, but it also lulls us into a false sense of security and analysts have made blunders as a result.

Any data that's been through an averaging process will tend to follow a normal distribution. For example, if you were analyzing average school test scores you should expect them to follow a normal distribution, similarly for transaction values by retail stores, and so on. I've seen data scientists claim brilliant data insights by announcing their data is normally distributed, but they got it through an averaging process, so of course it was normally distributed.

The CLT is one of the reasons why the normal distribution is so prevalent, but it's not the only reason and of course, not all data is normally distributed. I've seen junior analysts make mistakes because they've assumed their data is normally distributed when it wasn't.

# A little more rigor

I've been deliberately loose in my description of the CLT so far so I can explain the general idea. Let's get more rigorous so we can dig into this a bit more. Let's deal with some terminology first.

## Central tendency

In statistics, there's something called a "central tendency" which is a measurement that summarizes a set of data by giving a middle or central value. This central value is often called the average. More formally, there are three common measures of central tendency:

• The mode. This is the value that occurs most often.
• The median. Rank order the data and this is the middle value.
• The mean. Sum up all the data and divide by the number of values.

These three measures of central tendency have different properties, different advantages, and different disadvantages. As an analyst, you should know what they are.

(Depending on where you were educated, there might be some language issues here. My American friends tell me that in the US, the term "average" is always a synonym for the mean, in Britain, the term "average" can be the mean, median, or mode but is most often the mean.)

For symmetrical distributions, like the normal distribution, the mean, median, and mode are the same, but that's not the case for non-symmetrical distributions.

The term "central" in the central limit theorem is referring to the central or "average" value.

## iid

The CLT only applies to iid data, which means independent and identically distributed. Here's what iid means.

• Each sample in the data is independent of the other samples. This means selecting or removing a sample does not affect the value of another sample.
• All the samples come from the same probability distribution.

## When the CLT doesn't apply

Fortunately for us, the CLT applies to almost all distributions an analyst might come across, but there are exceptions. The underlying distribution must have a finite variance, which rules out using it with distributions like the Cauchy distribution. The samples must be iid as I said before.

## A re-statement of the CLT

Given data that's distributed with a finite variance and is iid, if we take n samples, then:

• as $$n \to \infty$$, the sample mean converges to the population mean
• as $$n \to \infty$$, the distribution of the sample means approximates a normal distribution

Note this formulation is in terms of the mean. This version of the CLT also applies to sums because the mean is just the sum divided by a constant (the number of samples).

# A different version of the CLT

There's another version of the CLT that's not well-known but does come up from time to time in more advanced analysis. The usual version of the CLT is expressed in terms of means (which is the sum divided by a constant). If instead of taking the sum of the samples, we take their product, then instead of the products tending to a normal distribution they tend to a log-normal distribution. In other words, where we have a quantity created from the product of samples then we should expect it to follow a log-normal distribution.

# What should I take away from all this?

Because of the CLT, the mean and standard deviation mostly work regardless of the underlying distribution. In other words, you don't have to know how your data is distributed to do basic analysis on it. BUT the CLT only kicks in above a certain sample size (which can vary with the underlying distribution but is usually around 30) and there are cases when it doesn't apply.

You should know what to do when you have a small sample size and know what to watch out for when you're relying on the CLT.

You should also understand that any process that sums (or products) data will lead to a normal distribution (or log-normal).

# What does data science boil down to?

Data science is a relatively new discipline that means different things to different people (most notably, to different employers). Some organizations focus solely on machine learning, while other lean on interpretation, and yet others get close to data engineering. In my view, all of these are part of the data science role.

I would argue data science generally is about three distinct areas:

• Prediction. The ability to accurately extrapolate from existing data sets to make forecasts about future behavior. This is the famous machine learning aspect and includes solutions like recommender systems.
• Distinction. The key question here is: "are these numbers different?". This includes the use of statistical techniques to decide if there's a difference or not, for example, specifying an A/B test and explaining its results.
• Interpretation. What are the factors that are driving the system? This is obviously related to prediction but has similarities to distinction too.

(A similar view of data science to mine: Calvin.Andrus, CC BY-SA 3.0, via Wikimedia Commons)

I'm going to talk through these areas and list the skills I think a data scientist needs. In my view, to be effective, you need all three areas. The real skill is to understand what type of problem you face and to use the correct approach.

# Distinction - are these numbers different?

This is perhaps the oldest area and the one you might disagree with me on. Distinction is firmly in the realm of statistics. It's not just about A/B tests or quasi-experimental tests, it's also about evaluating models too.

Here's what you need to know:

• Confidence intervals.
• Sample size calculations. This is crucial and often overlooked by experienced data scientists. If your data set is too small, you're going to get junk results so you need to know what too small is. In the real world. increasing the sample size is often not an option and you need to know why.
• Hypothesis testing. You should know the difference between a t-test and a z-test and when a z-test is appropriate (hint: sample size).
• α, β, and power. Many data scientists have no idea what statistical power is. If you're doing any kind of statistical testing, you need to have a firm grasp of power.
• The requirements for running a randomized control trial (RCT). Some experienced data scientists have told me they were analyzing results from an RCT, but their test just wasn't an RCT - they didn't really understand what an RCT was.
• Quasi-experimental methods. Sometimes, you just can't run an RCT, but there are other methods you can use including difference-in-difference, instrumental variables, and regression discontinuity.  You need to know which method is appropriate and when.
• Regression to the mean. This is why you almost always need a control group. I've seen experienced data scientists present results that could almost entirely be explained by regression to the mean. Don't be caught out by one of the fundamentals of statistics.

# Prediction - what will happen next?

This is the piece of data science that gets all the attention, so I won't go into too much detail.

Here's what you need to know:

• The basics of machine learning models, including:
• Generalized linear modeling
• Random forests (including knowing why they are often frowned upon)
• k-nearest neighbors/k-means clustering
• Support Vector Machines
• Cross-validation, regularization, and their limitations.
• Variable importance and principal component analysis.
• Loss functions, including RMSE.
• The confusion matrix, accuracy, sensitivity, specificity, precision-recall and ROC curves.

There's one topic that's not on any machine learning course or in any machine learning book that I've ever read, but it's crucially important: knowing when machine learning fails and when to stop a project.  Machine learning doesn't work all the time.

# Interpretation - what's going on?

The main techniques here are often data visualization. Statistical summaries are great, but they can often mislead. Charts give a fuller picture.

Here are some techniques all data scientists should know:

• Heatmaps
• Violin plots
• Scatter plots and curve fitting
• Bar charts
• Regression and curve fitting.

They should also know why pie charts in all their forms are bad.

A good knowledge of how charts work is very helpful too (the psychology of visualization).

# What about SQL and R and Python...?

You need to be able to manipulate data to do data science, which means SQL, Python, or R. But plenty of people use these languages without being data scientists. In my view, despite their importance, they're table stakes.

# Book knowledge vs. street knowledge

People new to data science tend to focus almost exclusively on machine learning (prediction in my terminology) which leaves them very weak on data analysis and data exploration; even worse, their lack of statistical knowledge sometimes leads them to make blunders on sample size and loss functions. No amount of cross-validation, regularization, or computing power will save you from poor modeling choices. Even worse, not knowing statistics can lead people to produce excellent models of regression to the mean.

Practical experience is hugely important; way more important than courses. Obviously, a combination of both is best, which is why PhDs are highly sought after; they've learned from experience and have the theoretical firepower to back up their practical knowledge.

# $$\beta$$ is $$\alpha$$ if there's an effect

In hypothesis testing, there are two kinds of errors:

• Type I - we say there's an effect when there isn't. The threshold here is $$\alpha$$.
• Type II - we say there's no effect when there really is an effect. The threshold here is $$\beta$$.

This blog post is all about explaining and calculating $$\beta$$.

# The null hypothesis

Let's say we do an A/B test to measure the effect of a change to a website. Our control branch is the A branch and the treatment branch is the B branch. We're going to measure the conversion rate $$C$$ on both branches. Here are our null and alternative hypotheses:

• $$H_0: C_B - C_A = 0$$ there is no difference between the branches
• $$H_1: C_B - C_A \neq 0$$ there is a difference between the branches

Remember, we don't know if there really is an effect, we're using procedures to make our best guess about whether there is an effect or not, but we could be wrong. We can say there is an effect when there isn't (Type I error) or we can say there is no effect when there is (Type II error).

Mathematically, we're taking the mean of thousands of samples so the central limit theorem (CLT) applies and we expect the quantity $$C_B - C_A$$ to be normally distributed. If there is no effect, then $$C_B - C_A = 0$$, if there is an effect $$C_B - C_A \neq 0$$.

# $$\alpha$$ in a picture

Let's assume there is no effect. We can plot out our expected probability distribution and define an acceptance region (blue, 95% of the distribution) and two rejection regions (red, 5% of the distribution). If our measured $$C_B - C_A$$ result lands in the blue region, we will accept the null hypothesis and say there is no effect, If our result lands in the red region, we'll reject the null hypothesis and say there is an effect. The red region is defined by $$\alpha$$.

One way of looking at the blue area is to think of it as a confidence interval around the mean $$x_0$$:

$\bar x_0 + z_\frac{\alpha}{2} s \; and \; \bar x_0 + z_{1-\frac{\alpha}{2}} s$

In this equation, s is the standard error in our measurement. The probability of a measurement $$x$$ lying in this range is:

$0.95 = P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ]$

If we transform our measurement $$x$$ to the standard normal $$z$$, and we're using a 95% acceptance region (boundaries given by $$z$$ values of 1.96 and -1.96), then we have for the null hypothesis:

$0.95 = P[-1.96 < z < 1.96]$

# $$\beta$$ in a picture

Now let's assume there is an effect. How likely is it that we'll say there's no effect when there really is an effect? This is the threshold $$\beta$$.

To draw this in pictures, I want to take a step back. We have two hypotheses:

• $$H_0: C_B - C_A = 0$$ there is no difference between the branches
• $$H_1: C_B - C_A \neq 0$$ there is a difference between the branches

We can draw a distribution for each of these hypotheses. Only one distribution will apply, but we don't know which one.

If the null hypothesis is true, the blue region is where our true negatives lie and the red region is where the false positives lie. The boundaries of the red/blue regions are set by $$\alpha$$. The value of $$\alpha$$ gives us the probability of a false positive.

If the alternate hypothesis is true, the true positives will be in the green region and the false negatives will be in the orange region. The boundary of the green/orange regions is set by $$\beta$$. The value of $$\beta$$ gives us the probability of a false negative.

# Calculating $$\beta$$

Calculating $$\beta$$ is calculating the orange area of the alternative hypothesis chart. The boundaries are set by $$\alpha$$ from the null hypothesis. This is a bit twisty, so I'm going to say it again with more words to make it easier to understand.

$$\beta$$ is about false negatives. A false negative occurs when there is an effect, but we say there isn't. When we say there isn't an effect, we're saying the null hypothesis is true. For us to say there isn't an effect, the measured result must lie in the blue region of the null hypothesis distribution.

To calculate $$\beta$$, we need to know what fraction of the alternate hypothesis lies in the acceptance region of the null hypothesis distribution.

Let's take an example so I can show you the process step by step.

1. Assuming the null hypothesis, set up the boundaries of the acceptance and rejection region. Assuming a 95% acceptance region and an estimated mean of x, this gives the acceptance region as:
$P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ]$ which is the mean and 95% confidence interval for the null hypothesis. Our measurement $$x$$ must lie between these bounds.
2. Now assume the alternate hypothesis is true. If the alternate hypothesis is true, then our mean is $$\bar x_1$$.
3. We're still using this equation from before, but this time, our distribution is the alternate hypothesis.
$P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] ]$
4. Transforming to the standard normal distribution using the formula $$z = \frac{x - \bar x_1}{\sigma}$$, we can write the probability $$\beta$$ as:
$\beta = P \left [ \frac{\bar x_0 + z_\frac{\alpha}{2} s - \bar x_1}{s} < z < \frac{ \bar x_0 + z_{1-\frac{\alpha}{2}} s - \bar x_1}{s} \right ]$

This time, let's put some numbers in.

• $$n = 200,000$$ (100,000 per branch)
• $$C_B = 0.062$$
• $$C_A = 0.06$$
• $$\bar x_0= 0$$ - the null hypothesis
• $$\bar x_1 = 0.002$$ - the alternate hypothesis
• $$s = 0.00107$$  - this comes from combining the standard errors of both branches, so $$s^2 = s_A^2 + s_B^2$$, and I'm using the usual formula for the standard error of a proportion, for example, $$s_A = \sqrt{\frac{C_A(1-C_A)}{n} }$$

Plugging them all in, this gives:
$\beta = P[ -3.829 < z < 0.090]$
which gives $$\beta = 0.536$$

# This is too hard

This process is complex and involves lots of steps. In my view, it's too complex. It feels to me that there must be an easier way of constructing tests. Bayesian statistics holds out the hope for a simpler approach, but widespread adoption of Bayesian statistics is probably a generation or two away. We're stuck with an overly complex process using very difficult language.

# Why some projects are harder than others

Over my career, I've had the experience of working on projects that have gone wonderfully well and I've worked on projects that just ran into the sand and went nowhere. I've come to recognize the red flashing warning signs for a certain type of project that's pathologically bad: they tend to be projects involving wicked problems or have the characteristics of wicked problems. Interestingly, I've come across more wicked problems in data science than elsewhere.

(Wicked problems can be real devils to work on - they can damage your career if not handled correctly.  Elcom.stadler, CC BY-SA 4.0, via Wikimedia Commons)

# Wicked problems

The term 'wicked problem' comes from the planning and policy world [Rittel and Webber] and refers to problems that are difficult or impossible to fix inside the current social, political, and economic system. A good example is solving poverty; there are many, many stakeholders, each with fiercely different views, and no clear measure of success (how is poverty measured, is the goal reduction or eliminations, etc.). Poverty is also linked to other factors too, like level of education, health, housing, etc. If you were a politician, do you think you could solve poverty?

(Properties of wicked problems. Image source: Wikimedia Commons, License: Creative Commons, Author: Christian Sarkar)

In the five decades since Rittel and Weber first discussed wicked problems, researchers have identified some of their key characteristics:

• Wicked problems are not fully understood until after the creation of a solution.
• Wicked problems have no stopping rule, there's nothing to tell you that you've reached an optimal solution.
• Solutions to wicked problems are not right or wrong: they are better or worse, or good-enough or not-good-enough.
• Every wicked problem is new: you can't apply prior learning to it.
• Wicked problems have no alternative solutions to choose from.

Rittel and Weber's seminal paper points out a key feature of these types of problems: they're not amenable to traditional project management using a phased approach (usually something like "gather data", "synthesize data", "create plan", "execute on plan", etc.).  This is crucial to understanding why projects solving wicked problems go wrong.

# Wicked problems in software

If you think wicked problems sound a lot like some software development projects, you're not alone. In 1990, DeGrace and Stahl published "Wicked problems, righteous solutions" which laid out the comparison and compared the utility of different software development methodologies to solve wicked problems. To state the obvious, the killers for software project predictability are understanding the problem and applying prior learning.

Readers who know agile software development methods are probably jumping up right now and saying 'that's why agile was developed!' and they're partly right. Agile is a huge improvement on the waterfall approach, but it's not a complete solution. Even with agile, wicked problems can be extremely hard to solve. I've had the experience of working on a project where we found a new critical requirement right towards the end, and no amount of agile would have changed that.

# Wicked problems in data science

Data science has its own wicked problems, which I'll put into two buckets.

The first is the ethical implications of technology. Facial recognition obviously has profound implications for society, but there are well-known issues of racial bias in other data science-based systems too (see for example, Obermeyer). Resolving these issues isn't only a data science problem, in fact, I would say it can't only be a data science problem. This makes these projects wicked in the original sense of the term.

The other bucket is operational. Although some data science problems are well-defined, many are not. In several projects, I've had the experience of finding out something new and fundamental late in the project. To understand the problem, you have to solve it. For example, you may be tasked with reducing the RMSE for a model below some threshold, but as your model becomes more sophisticated, you might find irreducible randomness or as your understanding of the problem increases by solving it, you may find there are key missing features.

Here are some signposts for wicked problems in data science:

• Any algorithm involved in offering goods or services to the public. Racial, gender or other biases may become big issues and these risks are rarely outlined in the project documentation - in fact, they may only be discovered very, very late into the project. Even worse, there's often no resource allocation to manage them.
• No one in your organization has attempted to solve a problem like this before and none of the people on the project have prior experience working on similar projects.
• The underlying problem is not fully understood and/or not fully studied.
• No clear numerical targets for project quality. Good targets might be thresholds for false error rates, RMSE, F1 scores, and so on.

# What's to be done?

Outline the risks and manage them

It's always good practice to have requirements specifications and similar documents. These project documents should lay out project risks and steps to counter them. For example, facial recognition projects might include sections on bias or ethics and the steps necessary to counter them. Managing these risks takes effort, which includes effort spent on looking for risks and estimating their impact.

Expect the unexpected

If wicked problems can't be fully understood until they're solved, this is a huge project risk. If a new requirement is found late in the project, it can add substantial project time. Project plans should allow for finding something new late into the project, in fact, if we're solving a wicked problem, we should expect to find something new late in the project.

Set expectations

All of the stakeholders (technical and non-technical) should know the risks before the project begins and should know the consequences of finding something late in the project. Everyone needs to understand this is a wicked project with all the attendant risks.

Communications

Stakeholders need to know about new issues and project progress. They must understand the project risks.

Overall

If a lot of this sounds like good project management, that's because it is. Data science projects are often riskier than other projects and require more robust project management. A good understanding of the dynamics of wicked problems is a great start.

# Going underground

The London Underground tube map is a master class in information visualization. It's been described in detail in many, many places, so I'm just going to give you a summary of why it's so special and what we can learn from it. Some of the lessons are about good visual design principles, some are about the limitations of design, and some of them are about wealth and poverty and the unintended consequences of abstraction.

(London Underground map.)

# The problem

From its start in 1863, the underground train system in London grew in a haphazard fashion. With different railway companies building different lines there was no sense of creating a coherent system.

Despite the disorder, when it was first built it was viewed as a marvel and had a cultural impact beyond just transport; Conan Doyle wove it into Sherlock Holmes stories, H.G. Wells created science fiction involving it, and Virginia Woolf and others wrote about it too.

After various financial problems, the system was unified under government control. The government authority running it wanted to promote its use to reduce street-level congestion but the problem was, there were many different lines that only served part of the capital. Making it easy to use the system was hard.

Here's an early map of the system so you can see the problem.

(1908 tube map. Image source: Wikimedia Commons.)

The map's hard to read and it's hard to follow. It's visually very cluttered and there are lots of distracting details; it's not clear why some things are marked on the map at all (why is ARMY & NAVY AND AUXILLARY STORES marked so prominently?). The font is hard to read, the text orientation is inconsistent, and the contrast of station names with the background isn't high enough.

The problem gets even worse when you zoom out to look at the entire system. Bear in mind, stations in central London are close together but they get further apart as you go into the suburbs. Here's an early map of the entire system, do you think you could navigate it?

(1931 whole system tube map.)

Of course, the printing technology of the time was more limited than it is now, which made information representation harder.

# Design ideas in culture

To understand how the tube map as we know it was created, we have to understand a little of the design culture of the time (the early 1930s).

Electrical engineering was starting as a discipline and engineers were creating circuit diagrams for new electrical devices. These circuit diagrams showed the connection between electrical components, not how they were laid out on a circuit board. Circuit diagrams are examples of topological maps.

(Example circuit diagram. Show electrical connections between components, not how they're laid out on a circuit board. Image source: Wikimedia Commons, License: Public domain.)

The Bauhaus school in Germany was emphasizing art and design in mass-produced items, bringing high-quality design aesthetics into everyday goods. Ludwig Mies van der Rohe, the last director of the Bauhaus school, used a key aphorism that summarized much of their design philosophy: "less is more".

(Bauhaus kitchen design 1928 - they invented much of the modern design world. Image source: Wikimedia Commons, License: Public domain)

The modern art movement was in full swing, with the principles of abstraction coming very much to the fore. Artists were abstracting from reality in an attempt to represent an underlying truth about their subjects or about the world.

(Piet Mondrian, Composition 10. Image source: Wikimedia Commons, License: Public Domain.)

To put it simply, the early 1930s were a heyday of design that created much of our modern visual design language.

# Harry Beck's solution - form follows function

In 1931, Harry Beck, a draughtsman for London Underground, proposed a new underground map. Beck's map was clearly based on circuit diagrams: it removed unnecessary detail to focus on what was needed. In Beck's view, what was necessary for the tube map was just the stations and the lines, plus a single underlying geographical detail, the river Thames.

Here's his original map. There's a lot here that's very, very different from the early geographical maps.

# The design grammar of the tube map

The modern tube map is a much more complex beast, but it still retains the ideas Harry Beck created. For simplicity, I'm going to use the modern tube map to explain Beck's design innovations. There is one underlying and unifying idea behind everything I'm going to describe: consistency.

Topological not geographical. This is the key abstraction and it was key to the success of Beck's original map. On the ground, tube lines snake around and follow paths determined by geography and the urban landscape. This makes the relationship between tube lines confusing. Beck redrew the tube lines as straight lines without attempting to preserve the geographic relations of tube lines to one another. He made the stations more or less equidistant from each other, whereas, on the ground, the distance between stations varies widely.

The two images below show the tube map and a geographical representation of the same map. Note how the tube map substantially distorts the underlying geography.

(The tube map. Image source: BBC.)

(A geographical view of the same system. Image source: Wikimedia Commons.)

Removal of almost all underlying geographical features. The only geographical feature on tube maps is the river Thames. Some versions of the tube map removed it, but the public wanted it put back in, so it's been a consistent feature for years now.

(The river Thames, in blue, is the only geographic feature on the map.)

A single consistent font.  Station names are written with the same orientation. Using the same font and the same text orientation makes reading the map easier. The tube has its own font, New Johnston, to give a sense of corporate identity.

(Same text orientation, same font everywhere.)

High contrast. This is something that's become easier with modern printing technology and good-quality white paper. But there are problems. The tube uses a system of fare zones which are often added to the map (you can see them in the first two maps in this section, they're the gray and white bands). Although this is important information if you're paying for your tube ticket, it does add visual clutter. Because of the number of stations on the system, many modern maps add a grid so you can locate stations. Gridlines are another cluttering feature.

Consistent symbols. The map uses a small set of symbols consistently. The symbol for a station is a 'tick' (for example, Goodge Street or Russell Square). The symbol for a station that connects two or more lines is a circle (for example, Warren Street or Holborn).

Graphical rules. Angles and curves are consistent throughout the map, with few exceptions - clearly, the map was constructed using a consistent set of layout rules. For example, tube lines are shown as horizontal, vertical, or 45-degree lines in almost all cases.

# The challenge for the future

The demand for mass transit in London has been growing for very many years which means London Underground is likely to have more development over time (new lines, new stations). This poses challenges for map makers.

The latest underground maps are much more complicated than Harry Beck's original. Newer maps incorporate the south London tram system, some overground trains, and of course the new Elizabeth Line. At some point, a system becomes so complex that even an abstract simplification becomes too complex. Perhaps we'll need a map for the map.

# A trap for the unwary

The tube map is topological, not geographical. On the map, tube stations are roughly the same distance apart, something that's very much not the case on the ground.

Let's imagine you had to go from Warren Street to Great Portland Street. How would you do it? Maybe you would get the Victoria Line southbound to Oxford Circus, change to the Bakerloo Line northbound, change again at Baker Street, and get the Circle Line eastbound to Great Portland Street. That's a lot of changes and trains. Why not just walk from Warren Street to Great Portland Street? They're less than 500m apart and you can do the walk in under 5 minutes. The tube map misleads people into doing stuff like this all the time.

Let's imagine it's a lovely spring day and you're traveling to Chesham on the Metropolitan Line. If Great Portland Street and Warren Street are only 482m apart, then it must be a nice walk between Chalfont & Latimer and Chesham, especially as they're out in the leafy suburbs. Is this a good idea? Maybe not. These stations are 6.19km apart.

Abstractions are great, but you need to understand that's exactly what they are and how they can mislead you.

# Using the map to represent data

The tube map is an icon, not just of the tube system, but of London itself. Because of its iconic status, researchers have used it as a vehicle to represent different data about the city.

James Cheshire of University College London mapped life expectancy data to tube stations, the idea being, you can spot health disparities between different parts of the city. He produced a great map you can visit at tubecreature.com. Here's a screenshot of part of his map.

You go from a life expectancy of 78 at Stockwell to 89 at Green Park, but the two stations are just 4 stops apart. His map shows how disparities occur across very short distances.

Mark Green of the University of Sheffield had a similar idea, but this time using a more generic deprivation score. Here's his take on deprivation and the tube map, the bigger circles representing higher deprivation.

Once again, we see the same thing, big differences in deprivation over short distances.

# What the tube map hides

Let me show you a geographical layout of the modern tube system courtesy of Wikimedia. Do you spot what's odd about it?

(Geographical arrangement of tube lines. Image source: Wikimedia Commons, License: Creative Commons.)

Look at the tube system in southeast London. What tube system? There are no tube trains in southeast London. North London has lots of tube trains, southwest London has some, and southeast London has none at all. What part of London do you think is the poorest?

The tube map was never designed to indicate wealth and poverty, but it does that. It clearly shows which parts of London were wealthy enough to warrant underground construction and which were not. Of course, not every area in London has a tube station, even outside the southeast of London. Cricklewood (population 80,000) in northwest London doesn't have a tube station and is nowhere to be seen on the tube map.

The tube map leaves off underserved areas entirely, it's as if southeast London (and Cricklewood and other places) don't exist. An abstraction meant to aid the user makes whole communities invisible.

Now look back at the previous section and the use of the tube map to indicate poverty and inequality in London. If the tube map is an iconic representation of London, what does that say about the areas that aren't even on the map? Perhaps it's a case of 'out of sight, out of mind'.

This is a clear reminder that information design is a deeply human endeavor. A value-neutral expression of information doesn't exist, and maybe we shouldn't expect it to.

# Takeaways for the data scientist

As data scientists, we have to visualize data, not just for our fellow data scientists, but more importantly for the businesses we serve. We have to make it easy to understand and easy to interpret data. The London Underground tube map shows how ideas from outside science (circuit diagrams, Bauhaus, modernism) can help; information representation is, after all, a human endeavor. But the map shows the limits to abstraction and how we can be unintentionally led astray.

The map also shows the hidden effects of wealth inequality and the power of exclusion; what we do does not exist in a cultural vacuum, which is true for both the tube map and the charts we produce too.

## Saturday, February 27, 2021

### Simpson's paradox: a trap for the naive analyst

Let's imagine you're the Chief Revenue Officer at a manufacturing company that sells tubes and cylinders. You're having trouble with European sales reps discounting, so you offer a spif: the country team that sells at the highest price gets a week-long vacation somewhere warm and sunny with free food and drink. The Italian and German sales teams are raring to go.

At the end of the quarter, you have these results [Wang]:

 Product type Cylinder Tube Sales team No sales Average price No sales Average price German 80 €100 20 €70 Italian 20 €120 80 €80

This looks like a clear victory for the Italians! They maintained a higher price for both cylinders and tubes! If they have a higher price for every item, then obviously, they've won. The Italians start packing their swimsuits.

Not so fast, say the Germans, let's look at the overall results.

 Sales team Average price German €94 Italian €88

Despite having a lower selling price for both cylinders and tubes, the Germans have maintained a higher selling price overall!

How did this happen? It's an instance of Simpon's paradox.

# Why the results reversed

Here's how this happened: the Germans sold more of the expensive cylinders and the Italians sold more of the cheaper tubes. The average price is the ratio of the total monetary amount/total sales quantity. To put it very simply, ratios (prices) can behave oddly.

Let's look at a plot of the selling prices for the Germans and Italians.

The blue circles are tubes and the orange circles are cylinders. The size of the circles represents the number of sales. The little red dot in the center of the circles is the price.

Let's look at cylinders. Plainly, the Italians sold them at a higher price, but they're the most expensive item and the Germans sold more of them. Now, let's look at tubes, once again, the Italians sold them at a higher price than the Germans, but they're cheaper than cylinders and the Italians sold more of them.

You can probably see where this is going. Because the Italians sold more of the cheaper items, their average (or pooled) price is dragged down, despite maintaining a higher price on a per-item basis. I've re-drawn the chart, but this time I've added a horizontal black line that represents the average.

The product type (cylinders or tubes) is known in statistics as a confounder because it confounds the results. It's also known as a conditioning variable.

# A disturbing example - does this drug work?

The sales example is simple and you can see the cause of the trouble immediately. Let's look at some data from a (pretend) clinical trial.

Imagine there's some disease that impacts men and women and that some people get better on their own without any treatment at all. Now let's imagine we have a drug that might improve patient outcomes. Here's the data [Lindley].

 Female Male Recovered Not Recovered Rate Recovered Not Recovered Rate Took drug 8 2 80% 12 18 40% Not take drug 21 9 70% 3 7 30%

Wow! The drug gives everyone an added 10% on their recovery rate. Surely we need to prescribe this for everyone? Let's have a look at the overall data.

 Everyone Recovered Not Recovered Rate Took drug 20 20 50% Not take drug 24 16 60%

What this data is saying is, the drug reduces the recovery rate by 10%.

Let me say this again.

• For men, the drug improves recovery by 10%.
• For women, the drug improves recovery by 10%.
• For everyone, the drug reduces recovery by 10%.

If I'm a clinician, and I know you have the disease, if you're a woman, I would recommend you take the drug, if you're a man I would recommend you take the drug, but if I don't know your gender, I would advise you not to take the drug. What!!!!!

This is exactly the same math as the sales example I gave you above. The explanation is the same. The only thing different is the words I'm using and the context.

# Simpson and COVID

In the United States, it's pretty well-established that black and Hispanic people have suffered disproportionately from COVID. Not only is their risk of getting COVID higher, but their health outcomes are worse too. This has been extensively covered in the press and on the TV news.

In the middle of 2020, the CDC published data that showed fatality rates by race/ethnicity. The fatality rate means the fraction of patients with COVID who die. The data showed a clear result: white people had the worst fatality rate of the racial groups they studied.

Doesn't this contradict the press stories?

No.

There are three factors at work:

• The fatality rate increases with age for all ethnic groups. It's much higher for older people (75+) than younger people.
• The white population is older than the black and Hispanic populations.
• Whites have lower fatality rates in almost all age groups.

This is exactly the same as the German and Italian sales team example I started with. As a fraction of their population, there are more old white people than old black and Hispanic people, so the fatality rates for the white population are dominated by the older age group in a way that doesn't happen for blacks and Hispanics.

In this case, the overall numbers are highly misleading and the more meaningful comparison is at the age-group level. Mathematically, we can remove the effect of different demographics to make an apples-to-apples comparison of fatality rates, and that's what the CDC has done.

# In pictures

Wikipedia has a nice article on Simpson's paradox and I particularly like the animation that's used to accompany it, so I'm copying it here.

Each of the dots represents a measurement, for example, it could be price. The colors represent categories, for example, German or Italian sales teams, etc. if we look at the results overall, the trend is negative (shown by the black dots and black line). If we look at the individual categories, the trend is positive (colors). In other words, the aggregation reverses the individual trends.

# The classic example - sex discrimination at Berkeley

The Simpson's paradox example that's nearly always quoted is the Berkeley sex discrimination case [Bickel]. I'm not going to quote it here for two reasons: it's thoroughly discussed elsewhere, and the presentation of the results can be confusing. I've stuck to simpler examples to make my point.

# American politics

A version of Simpson's paradox can occur in American presidential elections, and it very nicely illustrates the cause of the problem.

In 2016, Hilary Clinton won the popular vote by 48.2% to 46.1%, but Donald Trump won the electoral college by 304 to 227. The reason for the reversal is simple, it's the population spread among the states and the relative electoral college votes allocated to the states. As in the case of the rollup with the sales and medical data I showed you earlier, exactly how the data rolls up can reverse the result.

The question, "who won the 2016 presidential election" sounds simple, but it can have several meanings:

• who was elected president
• who got the most votes
• who got the most electoral college votes

The most obvious meaning, in this case, is, "who was elected president". But when you're analyzing data, it's not always obvious what the right question really is.

# The root cause of the problem

The problem occurs because we're using an imprecise language (English) to interpret mathematical results. In the sales and medical data cases, we need to define what we want.

In the sales price example, do we mean the overall price or the price for each category? The contest was ambiguous, but to be fair to our CRO, this wasn't obvious initially. Probably, the fairest result is to take the overall price.

For the medical data case, we're probably better off taking the male and female data separately. A similar argument applies for the COVID example. The clarifying question is, what are you using the statistics for? In the drug data case, we're trying to understand the efficacy of a drug, and plainly, gender is a factor, so we should use the gendered data. In the COVID data case, if we're trying to understand the comparative impact of COVID on different races/ethnicities, we need to remove demographic differences.

If this was the 1980s, we'd be stuck. We can't use statistics alone to tell us what the answer is, we'd have to use data from outside the analysis to help us [Pearl]. But this isn't the 1980s anymore, and there are techniques to show the presence of Simpson's paradox. The answer lies in using something called a directed acyclic graph, usually called a DAG. But DAGs are a complex area and too complex for this blog post that I'm aiming at business people.

# What this means in practice

There's a very old sales joke that says, "we'll lose money on every sale but make it up in volume". It's something sales managers like to quote to their salespeople when they come asking for permission to discount beyond the rules. I laughed along too, but now I'm not so quick to laugh. Simpson's paradox has taught me to think before I speak. Things can get weird.

Interpreting large amounts of data is hard. You need training and practice to get it right and there's a reason why seasoned data scientists are sought after. But even experienced analysts can struggle with issues like Simpson's paradox and multi-comparison problems.

The red alert danger for businesses occurs when people who don't have the training and expertise start to interpret complex data. Let's imagine someone who didn't know about Simpson's paradox had the sales or medical data problem I've described here. Do you think they could reach the 'right' conclusion?

The bottom line is simple: you've got to know what you're doing when it comes to analysis.

# References

[Bickel] Sex Bias in Graduate Admissions: Data from Berkeley, By P. J. Bickel, E. A. Hammel, J. W. O'Connell, Science, 07 Feb 1975: 398-404
[Lindley] Lindley, D. and Novick, M. (1981). The role of exchangeability in inference. The Annals
of Statistics 9 45–58.
[Pearl] Judea Pearl, Comment: Understanding Simpson’s Paradox, The American Statistician, 68(1):8-13, February 2014.
[Wang] Wang B, Wu P, Kwan B, Tu XM, Feng C. Simpson's Paradox: Examples. Shanghai Arch Psychiatry. 2018;30(2):139-143. doi:10.11919/j.issn.1002-0829.218026

# Why aren't 2D plots good enough?

Most data visualization problems involve some form of two-dimensional plotting, for example plotting sales by month. Over the last two hundred years, analysts have developed several different types of 2D plots, including scatter charts, line charts, and bar charts, so we have all the chart types we need for 2D data. But what happens if we have a 3D dataset?

The dataset I'm looking at is English Premier League (EPL) results. I want to know how the full-time scores are distributed, for example, are there more 1-1 results than 2-1 results? I have three numbers, the full-time home goals (FTHG), the full-time away goals (FTAG). and the number of games that had that score. How can I present this 3D data in a meaningful way?

(You can't rely on 3D glasses to visualize 3D data. Image source: Wikimedia Commons, License: Creative Commons, Author: Oliver Olschewski)

# Just the text

The easiest way to view the data is to create a table, so here it is. The columns are the away goals, the rows are the home goals, and the cell values are the number of matches with that result, so 778 is the number of matches with a score of 0-1.

This presentation is easy to do, and relatively easy to interpret. I can see 1-1 is the most popular score, followed by 1-0. You can also see that some scores just don't occur (9-9) and results with more than a handful of goals are very uncommon.

This is OK for a smallish dataset like this, but if there are hundreds of rows and/or columns, it's not really viable. So what can we do?

# Heatmaps

A heatmap is a 2D map where the 3rd dimension is represented as color. The more intense (or lighter) the color, the higher the value. For this kind of plot to work, you do have to be careful about your color map. Usually, it's best to choose the intensity of just one color (e.g. shades of blue). In a few cases, multiple colors can work (colors for political parties), but those are the exceptions.

Here's the same data plotted as a heatmap using the Brewer color palette "RdPu" (red-purple).

The plot does clearly show the structure. It's obvious there's a diagonal line beyond which no results occur. It's also obvious which scores are the most common. On the other hand, it's hard to get a sense of how quickly the frequency falls off because the human eye just isn't that sensitive to variations in color, but we could probably play around with the color scale to make the most important color variation occur over the range we're interested in.

This is an easy plot to make because it's part of R's ggplot package. Here's my code:

plt_goal_heatmap <- goal_distribution %>%
ggplot(aes(FTHG, FTAG, fill=Matches)) +
geom_tile() +
scale_fill_distiller(palette = "RdPu") +
ggtitle("Home/Away goal heatmap")

# Perspective scatter plot

Another alternative is the perspective plot, which in R, you can create using the 'persp' function. This is a surface plot as you can see below.

You can change your perspective on the plot and view it from other angles, but even from this perspective, it's easy to see the very rapid falloff in frequency as the scores increase.

However, I found this plot harder to use than the simple heatmap, and I found changing my viewing angle was awkward and time-consuming.

Here's my code in case it's useful to you:

persp(x = seq(0, max(goal_distribution$FTHG)), y = seq(0, max(goal_distribution$FTAG)),
z = as.matrix(
unname(
goal_distribution, FTAG, Matches, fill=0)[,-1])),
xlab = "FTHG", ylab = "FTAG", zlab = "Matches",
main = "Distribution of matches by score",
theta = 60, phi = 20,
expand = 1,
col = "lightblue")

# 3D scatter plot

We can go one stage further and create a 3D scatter chart. On this chart, I've plotted the x, y, and z values and color-coded them so you get a sense of the magnitude of the z values. I've also connected the points to the axis (the zero plane if you like) to emphasize the data structure a bit more.

As with the persp function,  you can change your perspective on the plot and view it from another angle.

The downside with this approach is it requires the 'plot3D' library in R and it requires you to install a new graphics server (XQuartz). It's a chunk of work to get to a visualization. The function to draw the plot is 'scatter3D'. Here's my code:

scatter3D(x=goal_distribution$FTHG, y=goal_distribution$FTAG,
z=goal_distribution\$Matches,
xlab = "FTHG", ylab = "FTAG", zlab = "Matches",
phi = 5,
theta = 40,
bty = "g",
type = "h",
pch = 19,
main="Distribution of matches by score",
cex = 0.5)

# What's my choice?

My goal was to understand the distribution of goals in the EPL, so what presentations of the data were most useful to me?

The simple table worked well and was the most informative, followed by the heatmap. I found both persp and scatter3D to be awkward to use and both consumed way more time than they were worth. The nice thing about the heatmap is that it's available as part of the wonderful ggplot library.

Bottom line: keep it simple.