Tuesday, July 22, 2025

Visualizing multi-dimensional data: score distributions in English football

Sometimes, visualization is very, very difficult

I want to visualize the frequency of scores in English league football from 1888 to today so I can see if there are any discernable patterns, e.g., is a score of 2-1 becoming more common than a score of 1-2? Trouble is, I’ve run into problems figuring out how to visualize my data in a way that provides insight. This blog post is all about the things I’ve tried and what I've discovered about score distributions. If you’re trying to visualize complex data, this might be helpful.

The dataset is scores and their frequency for English league football games by season and by league, so it looks like this:

Season	League tier	Home goals	Away goals	Frequency
1888-1889	1	0	0	0.015152
1888-1889	1	0	1	0.007576
1888-1889	1	0	2	0.037879
1888-1889	1	0	3	0.015152
1888-1889	1	0	4	0.022727
...

I want investigate if there are differences by league and changes over time. The obvious problem is, this data set has five dimensions.

3D bar charts

For one league and one season, I can plot a 3D bar chart. Here are two examples for the top tier in the 1888-1889 (left) and 2024-2025 seasons (right) (in 2025 it was called the Premier League, in 1888 it was the only tier). There’s plainly a big change in score distributions, so how can we investigate it over time?

Although these 3D bar charts have some use, they suffers from a number of problems:

The chart perspective means some scores can be hidden by others; you can change the viewpoint to avoid blockages, but it’s not a good solution.
I can’t show multiple charts on the same diagram because the 3D nature of the plot makes things confusing. Comparing two charts side by side means you can only pick out gross differences.
Finally, there are over 100 seasons for the top tier and there are currently five tiers, which is an awful lot of charts to compare.

Bottom line: 3D bar charts don’t fit the bill.

3D animated bar charts

I can animate the 3D bar chart so I can see changes over time. This doesn’t let me see different leagues on the same chart, but I can show leagues in separate charts. The two charts below show how scoring evolves for the top tier (currently called the Premier League) from 1888, and the lowest tier (currently called the National League) from 1979.

The animations show two different things. For the Premier League, large numbers of goals have become less frequent, and it’s much more common now to see low scoring games, e.g., 0-0 or 2-1, compared to high scoring games, e.g., 5-4, or 6-0. For both leagues, there’s notable variation season to season.

Other than that, it’s slim pickings. Animated 3D charts still suffer from many of the same problems as non-animated 3D charts. Time to try something else.

Heatmaps

Here are two heatmaps for two seasons for the top tier.

I find this easier to analyze than the 3D bar charts because bars aren't blocking the view of other bars. On the downside, it’s hard to resolve fine differences by color and of course, it’s hard to see multiple leagues on the same chart. There are over 100 season’s worth of data, and I don’t want to look at over 100 charts.

Animated heatmaps

To investigate changes over time I animated the heatmap for the top league and the fifth league

For my money, this is a better representation than the animated 3D bar chart. It’s clear that scores have become concentrated around 1-1 since 1888, but there are exceptions, big scores still happen.

Of course, animations are nice, but maybe a slider control would be a bit better?. I decided to try out a different plotting package to see if a slider makes things easier. On the chart below, click on the slider and move it around to see the changes over time.

The problem with heatmaps is it can be hard to see smallish changes, and although the animation and sliders are helpful, they don't really let you see changes over time very clearly.

Line chart grids

One obvious data representation is to draw a grid, with the axis representing home and away goals. Each box in the grid represents a score. In each box in the grid, put a line chart showing the change in frequency over time for that score. That’s what I’ve done in the chart below.

All of these charts have the same x-axis and y-axis scale, the x-axis is the season from 1888 to the present, and the y-axis is frequency. Some scores don't occur at all in the data set, so the chart is empty. I've labeled each of the charts with the score they represent. Here's a zoom in on the 1-1 score chart.

On this chart, the x-axis is the year and the y-axis is the frequency. To keep clutter down, I haven't labeled the axis.

The big grid chart is very squished-up here on this blog post, but on my big screen, I can make sense of it. I can see some trends:

‘Extreme’ scores are rare and there’s been little change since 1888, but they do occur.
Low scores (e.g., 5 or fewer goals per match) are much more common than extreme scores, and that’s been the way mostly since 1888.
Home advantage exists, but it’s getting smaller.
All of the 'action' is in the bottom left corner.

The obvious downside is how incredibly busy the big chart is. It’s helpful to just look at a subset (the bottom left) as I’ve shown below.

It would be possible to show multiple leagues on this type of plot. Maybe not on the full grid, but certainly on a subset.

Where I’ve ended up

Reviewing all of these charts, here are my takeaways about chart choice for multi-dimensional data.

The best approach is the line chart grids. The downside is, the grid can get huge, but the upside is, you get information on where to focus your analysis. Notably, the level of coding effort for the line chart grid is the lowest of all the other methods I used.
Heatmap animations are helpful, but they’re really more something that looks good rather than gives you a lot of insight. Heatmaps, whether animated or not, are better than 3D bar charts.
3D bar charts look pretty, but they’re not very useful.

And what did I discover about scores? The frequency of different scores hasn’t changed much over time. Low scores are much more frequent than high scores, but high scoring games still occur. In the top tier, there’s a noticeable drop in home wins.

The role of AI code generation

I used AI code generation to help with this investigation. I used Cursor as my editor, and used code completion extensively. Notably, Cursor did really badly with generating animations, so I used Claude to generate example code to get me started. Claude gave me a good starting place, but I had to extensively modify what it gave me. Code generation for some of the more complex Pandas operations wasn’t good in Cursor.

I found getting started with an example to be far more helpful than using documentation or using StackOverflow, in fact, a big waste of time for me was trying to get examples sourced from websites to work. Once I told Claude what I wanted, things went much more quickly.

The bottom line is, AI was a very helpful tool, but not magic fairy dust.

Do clubs playing at home win more?

Home field advantage is well-known to fans of different sports and is well-supported by the data. How prevalent is it in English football? That’s what I’m going to explore in this blog post.

A few years ago, I did a similar analysis, but with a much smaller data set (see https://blog.engora.com/2021/01/covid-and-soccer-home-team-advantage.html). This time, I have complete English football league data from 1888 to the 2024-2025 season, so my analysis is going to be much more thorough. Frankly, I was surprised by what I found.

What are the metrics?

The goal metric is simple, it’s just the mean over all the matches in a season of home goals – away goals. I’ll call it the per-match home goal advantage. A value of 0.25 would mean home clubs score 0.25 more goals per game than the away clubs over the course of a season. If there were no home advantage, we would expect the per-match home advantage to be 0.

What about winning games? Do home teams win more often? I chose a simple metric, which is the fraction of all wins which were home wins. I’ll call this the home win fraction.

\[home\ win\ fraction = \dfrac{count\ of\ home\ wins\ in\ a\ league\ and\ season}{count\ of\ all\ wins\ in\ a\ league\ and\ season}\]

If there were no home field advantage, we would expect this number of to 0.5. If every win was at home, it would be 1 and if every win was away, it would be 0. Of course, there’s an argument that home advantage might lead to a club that would otherwise be defeated drawing. That’s true, but it’s a lot harder to quantify, so for now, I’ve gone with the simpler metric.

Do home clubs win more than away clubs?

Here’s the home win advantage for every season for every league. The chart is interactive; click on the legend to turn leagues on and off. I have a version of this chart that shows the standard deviation, but it’s not very helpful, so I haven’t shown it in this blog post.

The obvious feature is the downward trend in home advantage across all leagues. I have a lot more to say about that later.

A little less obvious is the COVID dip, shown below for the Premier League (you can see if for other leagues in the chart above by zooming in). What exactly happened during COVID is complex, but here’s a summary:

2019-2020: matches were played behind closed doors (i.e., no fans in the stadium) after 13 March 2020. This is a good way into the season so the effect of no fans on the season will be muted.
2020-2021: most matches played behind closed doors. In essence, the entire season was affected.

In the lower leagues (tiers 3, 4, and 5), the 2019-2020 season stopped on 13th March 2020 and wasn’t resumed, outstanding matches were canceled.

Look at what happened during COVID: the home field advantage completely disappeared. That's so significant, I'm going to repeat it: during the 2020-2021 season, matches were played without spectators and during this season, and only this season, there was no home field advantage.

How big is the home field advantage?

Here’s the home goal advantage for all seasons for all leagues. The chart is interactive; click on the legend to turn leagues on and off.

Once again, the decline in home field advantage is obvious, as its disappearance during the COVID season. Remember, the COVID season was played without spectators.

In 2025, the home advantage is about 0.25 goals.

What does this mean?

(I've updated this section based on some feedback and the fact I was never happy with the writing.)

There are two very striking facts about the decline in home field advantage:

It's consistent and persistent. The decline is happening at a steady rate and has been going on since WWII.
It's happening in all leagues.

These facts have some implications for causal factors:

It can't be a one-off change like a rule change.
It can't be something that "trickles down" from the top league.
It has to be something that's present over the whole time period.

I’m going to repeat some of the analysis Sean Elvidge has done in his blog post (https://seanelvidge.com/articles/2025/Home_advantage_in_English_football/).

Researchers have theorized that home advantage is caused by a number of factors:

Home supporters cheering the club on.
Referee bias.
No travel for the home club.
Familiar surroundings.
Tactics and training.

COVID knocks out some of these explanations. If travel and familiar surrounding were the causes, they would still be present during COVID and we would see no change in home advantage. The fact that home advantage disappeared suggests that these are not causes.

Travel as an explanation suffers from other problems too. Yes, team travel in 2025 is easier than in 1950, but it's hard to argue it's got easier since 2005 (say). Remember, England (not the UK as a whole, just England), is a small country. One way of testing if travel is an issue would be to see if there's a correlation with distance traveled.

Tactics and training is another problematic explanation. It would require teams to play differently home and away. While I can buy that happens some of the time, I can't buy that it happens all of the time. It's also hard to see how training would provide an advantage to the home team alone. There's a further problem with testing this explanation. I can't think of any easily available data set that would let you do it.

Referee bias is a plausible explanation, it's possible that the home crowd may intimidate referees into a home club bias. This would testable by examining the difference in home and away red and yellow cards. The problem here is the data set. I have some red and yellow card data, but not a full set. Red cards were introduced in England in 1976 (and went away for a while), but there should be enough data for an analysis (if you can get the data).

Because of COVID, we know supporters have an effect on team performance. Why might it be declining? My favored explanation is away supporters. If more away fans are traveling to matches, then the away team gets a lift too. Fan support has some issues as an explanation though. As I'll show later In another blog post), the growth in attendance is uneven, but the decrease in home advantage is consistent. There's also very little easily available data on away supporters attendance.

My favorite explanations are referee bias and away supporters, but the lack of data is a real challenge.

Where is home advantage going?

Given current trends, it seems likely that home advantage will continue to dwindle and may effectively disappear in the years to come. It'll be interesting to see where this goes over the next few seasons.

Engora Data Blog