The summary is not the whole picture
If you just use summary statistics to describe your data, you can miss the bigger picture, sometimes literally so. In this blog post, I'm going to show you how relying on summaries alone can lead you catastrophically astray and I'm going to tell you how you can avoid making career-damaging mistakes.
The datasaurus is why you need to visualize your data. Source: Alberto Cairo. Open source.
What are summary statistics?
Summary statistics are parameters like the mean, standard deviation, and correlation coefficient; they summarize the properties of the data and the relationship between variables. For example, if the correlation coefficient, r, is about 0.8 for two data sets x and y, we might think there's a relationship between them, but if it's about 0, we might think there isn't.
The use of summary statistics is widely taught, every textbook emphasizes them, and almost everyone uses them. But if you use summary statistics in isolation from other methods you might miss important relationships - you should always visualize your data as we'll see.
Take a look at the four plots below. They're obviously quite different, but they all have the same summary statistics!
Here are the summary statistics data:
|Mean of x||9|
|Sample variance of x :||11|
|Mean of y||7.50|
|Sample variance of y :||4.125|
|Correlation between x and y||0.816|
|Linear regression line||y = 3.00 + 0.500x|
|Coefficient of determination of the linear regression :||0.67|
These plots were developed in 1973 by the statistician Francis Anscombe to make exactly this point: you can't rely on summary statistics, you need to visualize your data. The graphical relationship between the x and y variables are different in each case and imply different things. By plotting the data out, we can see what the relationships are, but summary statistics hide what's going on.
Let's zoom forward to 2016. The justly famous Alberto Cairo tweeted about Anscombe's quartet and illustrated the point with this cool set of summary statistics. He later expanded on his tweet in a short blog post.
|x standard deviation||16.7651|
|y standard deviation||26.9353|
What might you conclude from these summary statistics? I might say, the correlation coefficient is close to zero so there's not much of a relationship between the x and the y variables. I might conclude there's no interesting relationship between the x and y variables - but I would be wrong.
The summary might not mean anything to you, but the visualization surely will. This is the datasaurus data set, the x and the y variables draw out a dinosaur.
The datasaurus dozen
Two researchers at Autodesk Research took things a stage further. They started with Alberto Cairo's datasaurus and created a dozen other charts with exactly the same summary statistics as the datasaurus. Here they all are.
- summary statistics hide a lot
- there are many relationships between variables that will give summary statistics that look like noise
- always visualize your data!