Monday, September 14, 2020

The datasaurus: always visualize your data

The summary is not the whole picture

If you just use summary statistics to describe your data, you can miss the bigger picture, sometimes literally so. In this blog post, I'm going to show you how relying on summaries alone can lead you catastrophically astray and I'm going to tell you how you can avoid making career-damaging mistakes.

The datasaurus is why you need to visualize your data. Source: Alberto Cairo. Open source.

What are summary statistics?

Summary statistics are parameters like the mean, standard deviation, and correlation coefficient; they summarize the properties of the data and the relationship between variables. For example, if the correlation coefficient, r, is about 0.8 for two data sets x and y, we might think there's a relationship between them, but if it's about 0, we might think there isn't.

The use of summary statistics is widely taught, every textbook emphasizes them, and almost everyone uses them. But if you use summary statistics in isolation from other methods you might miss important relationships - you should always visualize your data as we'll see.

Anscombe's Quartet

Take a look at the four plots below. They're obviously quite different, but they all have the same summary statistics!

Here are the summary statistics data:

Property	Value
Mean of x	9
Sample variance of x : $\sigma ^{2}$	11
Mean of y	7.50
Sample variance of y : $\sigma ^{2}$	4.125
Correlation between x and y	0.816
Linear regression line	y = 3.00 + 0.500x
Coefficient of determination of the linear regression : $R^{2}$	0.67

These plots were developed in 1973 by the statistician Francis Anscombe to make exactly this point: you can't rely on summary statistics, you need to visualize your data. The graphical relationship between the x and y variables is different in each case and implies different things. By plotting the data out, we can see what the relationships are, but summary statistics hide what's going on.

The datasaurus

Let's zoom forward to 2016. The justly famous Alberto Cairo tweeted about Anscombe's quartet and illustrated the point with this cool set of summary statistics. He later expanded on his tweet in a short blog post.

Property	Value
n	142
mean	54.2633
x standard deviation	16.7651
y mean	47.8323
y standard deviation	26.9353
Pearson correlation	-0.0645

What might you conclude from these summary statistics? I might say, the correlation coefficient is close to zero so there's not much of a relationship between the x and the y variables. I might conclude there's no interesting relationship between the x and y variables - but I would be wrong.

The summary might not mean anything to you, but the visualization surely will. This is the datasaurus data set, the x and the y variables draw out a dinosaur.

The datasaurus dozen

Two researchers at Autodesk Research took things a stage further. They started with Alberto Cairo's datasaurus and created a dozen other charts with the same summary statistics as the datasaurus. Here they all are.

The summary statistics look like noise, but the charts reveal the underlying relationships between the x and y variables. Some of these relationships are obviously fun, like the star, but there are others that imply more meaningful relationships.

If all this sounds a bit abstract, let's think about how this might manifest itself in business. Let's imagine you're an analyst working for a large company. You have data on sales by store size for Europe and you've been asked to analyze the data to gain insights. You're under time pressure, so you fire up a Python notebook and get some quick summary statistics. You get summary statistics that look like the ones I showed you above. So you conclude there's nothing interesting in the data, but you might be very wrong.

You should plot the data out and look at the chart. You might see something that looks like the slanting charts above, maybe something like this:

the individual diagonal lines might correspond to different European countries (different regulations, different planning rules, different competition, etc.). There could be a very significant relationship that you would have missed by relying on summary data.

(The Autodesk Research team have posted their work as a paper you can read here.)

Lessons learned

The lessons you should take away from all this are simple:

summary statistics hide a lot
there are many relationships between variables that will give summary statistics that look like noise
always visualize your data!

Engora Data Blog