The summary is not the whole picture
If you just use summary statistics to describe your data, you can miss the bigger picture, sometimes literally so. In this blog post, I'm going to show you how relying on summaries alone can lead you catastrophically astray and I'm going to tell you how you can avoid making career-damaging mistakes.
The datasaurus is why you need to visualize your data. Source: Alberto Cairo. Open source.
What are summary statistics?
Summary statistics are parameters like the mean, standard deviation, and correlation coefficient; they summarize the properties of the data and the relationship between variables. For example, if the correlation coefficient, r, is about 0.8 for two data sets x and y, we might think there's a relationship between them, but if it's about 0, we might think there isn't.
The use of summary statistics is widely taught, every textbook emphasizes them, and almost everyone uses them. But if you use summary statistics in isolation from other methods you might miss important relationships - you should always visualize your data as we'll see.
Take a look at the four plots below. They're obviously quite different, but they all have the same summary statistics!
Here are the summary statistics data:
|Mean of x||9|
|Sample variance of x :||11|
|Mean of y||7.50|
|Sample variance of y :||4.125|
|Correlation between x and y||0.816|
|Linear regression line||y = 3.00 + 0.500x|
|Coefficient of determination of the linear regression :||0.67|
These plots were developed in 1973 by the statistician Francis Anscombe to make exactly this point: you can't rely on summary statistics, you need to visualize your data. The graphical relationship between the x and y variables is different in each case and implies different things. By plotting the data out, we can see what the relationships are, but summary statistics hide what's going on.
Let's zoom forward to 2016. The justly famous Alberto Cairo tweeted about Anscombe's quartet and illustrated the point with this cool set of summary statistics. He later expanded on his tweet in a short blog post.
|x standard deviation||16.7651|
|y standard deviation||26.9353|
What might you conclude from these summary statistics? I might say, the correlation coefficient is close to zero so there's not much of a relationship between the x and the y variables. I might conclude there's no interesting relationship between the x and y variables - but I would be wrong.
The summary might not mean anything to you, but the visualization surely will. This is the datasaurus data set, the x and the y variables draw out a dinosaur.
The datasaurus dozen
Two researchers at Autodesk Research took things a stage further. They started with Alberto Cairo's datasaurus and created a dozen other charts with exactly the same summary statistics as the datasaurus. Here they all are.
The summary statistics look like noise, but the charts reveal the underlying relationships between the x and y variables. Some of these relationships are obviously fun, like the star, but there are others that imply more meaningful relationships.
If all this sounds a bit abstract, let's think about how this might manifest itself in business. Let's imagine you're an analyst working for a large company. You have data on sales by store size for Europe and you've been asked to analyze the data to gain insights. You're under time pressure, so you fire up a Python notebook and get some quick summary statistics. You get summary statistics that look like the ones I showed you above. So you conclude there's nothing interesting in the data, but you might be very wrong.
You should plot the data out and look at the chart. You might see something that looks like the slanting charts above, maybe something like this:
the individual diagonal lines might correspond to different European countries (different regulations, different planning rules, different competition, etc.). There could be a very significant relationship that you would have missed by relying on summary data.
(The Autodesk Research team have posted their work as a paper you can read here.)
The lessons you should take away from all this are simple:
- summary statistics hide a lot
- there are many relationships between variables that will give summary statistics that look like noise
- always visualize your data!