Thursday, January 16, 2020

Correlation does not imply causation

Correlation is not causation

Because they’ve misunderstood one of the main rules of statistical evidence, I’ve seen people make serious business mistakes and damage their careers. The rule is a simple, but subtle one: correlation is not causation. I’m going to explain what this means and show you cases where it’s obviously true, and some cases where it’s less obvious. Let’s start with some definitions.

Clearly, causation means one thing causes another. For example, prolonged exposure to ultraviolet light causes sunburn, the Vibrio cholerae bacteria causes cholera, and recessions cause bankruptcies.

What is correlation?

Correlation occurs when two things vary in the same way. For example, lung cancer rates vary with the level of smoking, commuting times vary with the state of the economy, and health and longevity are correlated with income and wealth. The relationship usually becomes clear when we plot the data out, but it’s very rarely perfect. To give you a sense of what I mean, I’ve taken the relationship between brain mass and body mass in mammals and plotted the data below, each dot is a different type of mammal [Rogel-Salazar].

The straight line on the chart is a fit to the data. As you can see, there’s a relationship between brain and body mass but the dots are spread.

We measure how well two things are correlated with something called the correlation coefficient, r. The closer r is to 1 (or -1), the better the correlation (this is a gross simplification). I typically look for r to be 0.8 (or < -0.8) or better. For the brain and body data above, r is 0.89, so the correlation is ‘good’.

For causation to exist, to say that A causes B, we must be able to observe the correlation between A and B. If sunscreen is effective at reducing sunburn we should observe increased sunscreen use leading to reduced sunburn. However, we need more than correlation to prove causation (I’m skipping over details to keep it simple).

Correlations does not imply causation

Here’s the important bit: correlation does not imply causation. Just because two things are correlated does not imply that one causes the other. Two things could be very well correlated and there could be no causal relationship between them at all. There could be a confounding factor that causes both variables to move in the same way. In my view, misunderstanding this is the single biggest problem in data analysis.

The excellent website Spurious Correlations shows the problem in a fun way, I’ve adapted an example from the website to illustrate my point. Here are two variables I've shown varying with time.

(Image credit: Spurious Correlations)

Imagine one of the variables was sales revenue and the other was the number of hours of sales effort. The correlation between them is very high (r=0.998). Would you say the amount of sales effort causes the sales revenue? If sales revenue was important to you, would you invest in more sales hours? If I presented this evidence to you in an executive meeting, what would you say?

Actually, I lied to you. The red line is US spending on science, space, and technology and the black line is suicides by hanging, strangulation, and suffocation. How can these things be related to each other? Because there’s some other variable or variables both of them depend on, or frankly, just by chance. Think for a minute what happens as an economy grows, all kinds of expenditure goes up; sales of expensive wine go up, and people spend more on their houses. Does that mean sales of expensive wine cause people to spend more on houses?

(On the spurious correlations website there are a whole bunch of other examples, including: divorce rates in Maine correlated with per capita consumption of margarine, total revenue generated by arcades is correlated with the age of Miss America, and letters in the winning word of the Scripps National Spelling Bee are correlated with number of people killed by venomous spiders.)

The chart below shows the relationship between stork pairs and human births for several European locations 1980-1990 [Matthews]. Note r is high at 0.85.

Is this evidence that storks deliver babies? No. Remember correlation is not causation. There could well be many confounding variables here, for example, economic growth leading to more leisure time. Just because we don’t know what the confounding factors are doesn’t mean they don’t exist.

My other (possibly apocryphal) example concerns lice. In Europe in the middle ages, lice were considered beneficial (especially for children) because sick people didn’t have as many lice [Zinsser]. Technically, this type of causation mistake is known as the post hoc ergo propter hoc fallacy if you want to look it up.

Correlation/causation offenders

The causation/correlation problem often rears its ugly head in sales and marketing. Here are two examples I’ve seen, with the details disguised to protect the guilty.

I’ve seen a business analyst present the results of detailed sales data modeling and make recommendations for change based on the correlation/causation confusion. The sales data set was huge and they’d found a large number of correlations in the data (with good r values). They concluded that these correlations were causation, for example, in area X sales scaled with the number of sales reps and they concluded that more reps = more sales. They made a series of recommendations based on their findings. Unfortunately, most of the relationships they found were spurious and most of their recommendations and forecasts were later found to be wrong. The problem was, there were other factors at play that they hadn’t accounted for. It doesn’t matter how complicated the model or how many hours someone has put in, the same rule applies; correlation does not imply causation.

The biggest career blunder I saw was a marketing person claiming that visits to the company website were driving all company revenue, I remember them talking about the correlation and making the causation claim to get more resources for their group. Unfortunately, later on, revenue went down for reasons (genuinely) unrelated to the website. The website wasn’t driving all revenue - it was just one of a number of factors, including the economy and the product. However, their claim to be driving all revenue wasn’t forgotten by the executive team and the marketing person paid the career price.

Here’s what I think you should take away from all this. Just because two things appear to be correlated doesn’t mean there’s causation. In business, we have to make decisions on the basis of limited evidence and that’s OK. What’s not OK is to believe there’s evidence when there isn’t - specifically to infer causation from correlation. Statistics and experience teach us humility. The UK Highway Code has some good advice here, a green light doesn’t mean go, it means ‘proceed with caution'.

References

[Matthews] ‘Storks Deliver Babies (p=0.008)’, Robert Matthews, Teaching Statistics. Volume 22, Number 2, Summer 2000
[Rogel-Salazar] Rogel-Salazar, Jesus (2015): Mammals Dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.1565651.v1
[Zinsser] ‘Rats, lice, and history’, Hans Zinsser, Transaction Publishers, London, 2008

Saturday, January 11, 2020

Choropleth maps - pretty but misleading

Pretty, but misleading

You see choropleth maps everywhere: on websites, on the TV news, and in applications. They’re very pretty, they appeal to our sense of geography, but they can be horribly misleading. I’m going to show you why that’s the case and show you ways designers have sought to get around their problems.

What's a choropleth map?

A choropleth map is a geographic map with regions colored according to some criteria. A great example is election maps used in US Presidential elections. Each of the states is colored according to the party that won the state. Here’s an example result from a US Presidential election. Can you see what the problem is?

(Image Credit: adapted from Wikipedia.)

Looking at the map, who do you think won the election (I’m deliberately not telling you which election)? Do you think this election was a close one?

The trouble is, the US population density varies considerably from state to state, as does the number of Electoral College votes. In 2020, Rhode Island will have 4 Electoral College votes compared to Montana’s 3, but Montana is 120 times larger on the map. If you just glance at most US Presidential election choropleth maps, it looks like the Republican candidate won, even when he didn’t. The reason is the geographically large rural states are mostly Republican but have few Electoral College votes because their populations are relatively low. So the election choropleth map looks mostly red. Our natural tendency is to assume more ink = more important, but the choropleth map breaks this relationship giving a misleading representation.

By the way, the map I showed you is from the 1976 election, in which Jimmy Carter won 267 Electoral College votes to Gerard Ford's 240 (24 states to 27). Did you get who won from the map? Did you get the size of the victory?

Let’s take another example, the 2014 Scottish Independence referendum. Here’s the result by council area (local government area), pink for remain, green for independence. In this case, the remainers won, but what do you think the margin was? Was it close?

(Image credit: adapted from Wikipedia entry: https://commons.wikimedia.org/wiki/File:Scottish_independence_referendum_results.svg)

Despite the choropleth map’s overwhelming remain coloring, the actual result was 55%-45%. It looks like an overwhelming remain victory because the Scottish population is concentrated in a few areas. As in the US, there are large rural areas with few people that take up large amounts of chart space, exaggerating their level of importance.

Cartograms

Can we somehow represent data using some kind of map in a more proportional way? Cartograms distort the underlying geography to better represent some underlying variable. There’s a great example on the Geographical.co.uk website for the UK 2019 general election.

(Image credit: geographical.co.uk - taken from the article: http://geographical.co.uk/geopolitics/geopolitics/item/3557-ukge2019)

The chart on the left is a straight-up choropleth map of the UK by parliamentary constituency - note that constituencies have roughly equal populations but can have very different geographical areas. Blue is a Conservative parliamentary seat, yellow is SNP, orange is Liberal Democrat and red is Labour. The result was a decisive Conservative victory, but the choropleth map makes it look like a wipeout, which it wasn't. The middle chart makes each parliamentary constituency the same size on the page (actually hexagons for reasons I won't go into). Rural areas are shrunk and urban areas (especially London) are greatly expanded. The cartogram shows a much fairer, and in my view, more reasonable representation of the result. The chart on the right is a 'gridded population cartogram' that sizes each constituency by the number of people living in it. There are several of these maps on the web, but frankly, I've never been able to make much sense of them.

Cartograms in the house of mirrors

Mark Newman at the University of Michigan has a site presenting the 2016 US Presidential election as cartograms which is worth a look. Here's his map scaling the states to their Electoral College votes. Because the proportion of red and blue ink follows the Electoral College votes, the cartogram gives a fairer representation of the result (I find this representation easier to understand than the 2019 UK election result in the third chart above.).

(Image credit: Mark Newman, University of Michigan.)

An alternative approach is to use hexagons to represent the result:

(Image credit: ESRI UK & Ireland: https://communityhub.esriuk.com/geoxchange/2016/11/1/us-election-2016-battle-of-the-maps)

The hexagon representation has become much more popular in recent years, leading to designers calling this kind of chart a hexagram. As you might have guessed, on the whole, I prefer this type of cartogram.

All this is great in theory, but in practice there are problems. Hexagrams are great, but they're still unfamiliar to many users and might require explanation. They can also distort geography, reducing the display's usefulness, for example, can you easily identify Utah on the 2016 US Presidential hexagram above? Most packages and modules that display data don't yet come with out-of-the-box cartograms, meaning developers have to create something from scratch, which takes more effort.

What I do

Here's my approach: use choropleth maps for planning and hexagrams or other charts to represent quantitative results. Planning usually involves some sense of geography, for example, territory allocation in sales, in this case, a choropleth map can be useful because of its close ties to the underlying geography. To represent quantitative information (like election results), I prefer bar charts or other traditional charts. If you want something that's more geographical, I recommend some form of hexagram, but with the warning that you might have to build it yourself which can be very time-consuming.

Finding out more

Danny Dorling has written extensively about cartograms and I recommend his website: http://www.dannydorling.org/
The WorldMapper website presents lots of examples of cartograms using social and political data: https://worldmapper.org/