Thursday, July 31, 2025

Attendance at English football: a tale of tragedy and recovery

Sport and on-going tragedy

Analyzing sports data is normally a harmless activity, but sometimes, it takes you to much darker places, and analyzing attendance in English football takes you to some dark places very quickly.

This blog post is about trends in attendance in English football and what you can tell from the data. For this analysis, I’m going to flip the script and start to talk about causes first and then show you the data.

(Steenbergs from Ripon, United Kingdom, CC BY 2.0, via Wikimedia Commons. Newcastle United vs. Chelsea, 2010-11-28. Note everyone seated and almost all seats filled.)

Hooliganism, racism, antisemitism, and tragedy

Football in the UK has suffered from hooliganism almost since the beginning of professional football. There are reports of riots dating back to 1909 and vandalism from 1934. After the Second World War, hooliganism arrived in earnest and by the 1980’s, the English game was in deep trouble. Notoriously, violence followed English teams abroad. Innocent people were caught up in the mayhem and were killed. 

The UK government acted to crack down on offenders and the football authorities worked to make the game safer. Things improved slowly through the 1990’s, and the game’s battered reputation gradually improved and matches became safer to go to.

By the 1970’s, black players started to appear in English football teams, and so did racism. Some spectators threw banana peels on the pitch and make monkey noises when a black player got the ball. England ‘fans’ abused black English players at international matches. Perhaps unsurprisingly, hooligans had links to far right groups and were extremely racist.

The football governing bodies cracked down hard on racism and banned people for life from all stadiums, but of course, it’s still present.

As you might expect, antisemitism is also a problem. Tottenham Hotspur has a long and well-known connection with Jewish communities in London. So, opposing fans would regularly shout antisemitic abuse.

Sadly, to complete the picture, I need to point out some significant English horrors, only one of which (Heysel Stadium) was related to hooliganism.

  • 1946 Burnden Park, Bolton Wanderers vs. Stoke City. 33 fans killed by crush injuries.
  • 1985 Valley Parade, Bradford. Bradford City vs. Lincoln City. 56 spectators killed by fire.
  • 1985 Heysel Stadium, Brussels. Juventus vs. Liverpool. 39 fans killed by a collapsing wall. 
  • 1989 Hillsborough, Sheffield. Liverpool vs. Nottingham Forest. 97 fans killed by crush injuries. 
The Hillsborough disaster in particular triggered a series of wide-ranging changes, for example, the introduction of all-seater stadiums.

Given all this, would you take your children to a football match in the late 1980's or early 1990's?

You can read more on these issues here:

To sum it all up, the 1980’s were the nadir of English football. Things have got a lot better since then, but there’s work to be done.

Now, let’s look at the data.

Attendance numbers

The chart below shows total attendance by league for each year since the start of the league system in 1888. Total attendance is the sum of the attendance for each match held that season. The data is for English league matches only. The salmon-colored bands are World War I and World War II.

The chart is interactive, you can click on the legend to turn the leagues on and off, and you can use the toolbar on the right to zoom in and move around the data.

The chart shows the growth in attendance up to the immediate post-war period, followed by a decline. The nadir was 1989. I’ve explained above what was going on post-war, and given the issues, it’s no surprise attendance fell off.

The post-1989 recovery is probably due to a number of factors. The authorities have acted decisively to stamp out hooliganism, racism, and antisemitism. In the wake of the Hillsborough disaster, owners have invested in new stadiums that offer fans a much more pleasant experience. Fan culture has changed too, with more families attending matches and clubs actively trying to attract them. Notably, these changes are at all levels of the game.

COVID

COVID impacted the 2019-2020 season, with part of the season played behind closed doors and the lower leagues (tiers 3-5), cancelling games. However, the 2020-2021 season was played almost entirely without spectators. This is very clear in the attendance figures on the chart.

Home advantage

In a previous blog post, I used this chart below which shows the decline in home advantage (again, it’s interactive). One of my favorite explanations for home advantage was the effect of spectators. The key insight was that during COVID, home advantage (along with spectators), disappeared. 

Sadly, there’s a problem. Compare the shape of this graph to the attendance graph above. The decline in home advantage has been steady after the Second World War, but the attendance figures have not. If fans make a difference, we might expect more fans = more difference, but that doesn’t seem to be the case. Whatever the relationship between attendance and home advantage, it’s more subtle than just numbers.

Attendance distributions

The total numbers tell a story, but not the complete story. In any given season and league, there's a distribution of attendance, with some matches well-attended, while others are not. The change in distribution over time can tell us some very useful things.

Violin plots are very helpful to visualize distributions. In previous blog posts, I've talked about them in some depth, but for right now, you need to know they represent the distribution of the underlying data.

The charts below show violin (distribution) plots for the top four leagues. You can move the slider to see different years. The x-axis is attendance and you should note two things:

  • The x-axis range is different for each league.
  • The x-axis range changes year to year.

Slide the slider back through time and watch the shape of the distributions change. Compare the top tier (1) to the other tiers.

There were no games during World War I and II, and the 2020-2021 season was largely played behind closed doors because of COVID. The 2019-2020 season was also affected by COVID, but the story here is more subtle. Partway through the season, matches were played behind closed doors for leagues 1 and 2, but leagues 3 and 4 stopped the season and played no more games, which meant that league tiers 1 and 2 played games with no spectators while tiers 3 and 4 did not. This shows up strongly in the data; you can see significant zero attendance for tier 1 and 2 but not for tiers 3 and 4.

Strikingly, prior to about 1993, the distributions for all leagues are approximately unimodal with a fat tail. That's still mostly the case for lower leagues, but not for the top tier, the Premier League. The Premier League distribution is now bimodal.  To explain this, we need to know about stadium capacity and how full stadiums are (I'll call this the sold-out fraction, a sold-out fraction of 100% means the stadium is full to capacity and a sold-out fraction of 0% means it's completely empty).

Let's look at capacity first. The charts below show the capacity of the stadiums for the top four tiers. Note the Premier League has 'groupings' around 60,000 and 30,000. The Championship (tier 2) has a more linear distribution. Tiers 3 (League One) and 4 (League Two) also show groupings. The stadium size grouping is clearly visible in the Premier League attendance violin charts, it's the bimodal distribution. But we don't see the stadium distribution for League One and League Two. Why?

The answer lies in the sold-out fraction numbers. In the table below, I show the sold-out fraction by league-tier for 2024-2025.

League name League tier Sold-out fraction 2024-2025
Premier League 1 98.9%
Championship 2 81.4%
League One 3 68.1%
League Two 4 56.5%

At 98.9% sold-out, Premier League attendance is clearly limited by stadium size, so you would expect the stadium size groupings to clearly show up in the data, which they do. For the lower leagues, the sold-out fraction is less, meaning stadium size isn't a limiting factor and doesn't show up so much in the attendance data. 

There are a couple of points to make about stadium size. In the Premier League, the stadium size groupings support the idea of a league-within-a-league. Building 60,000+ capacity stadiums is hugely expensive, but if you can fill them, you get more revenue; Man Utd's stadium has a capacity of 74,197 compared to Bournemouth's 11,307 which is vastly different in size and of course, ticket sales. In the lower leagues, there are some stadiums with capacity way in excess of attendance, which must be a financial drag. Clubs are still building up their stadiums across all leagues, which is a striking vote of confidence in the future.

Attendance figures tell us a lot about changes in support and the structure of the game.

What of the future?

I’m hopeful for the future. I like the initiatives clubs are taking to make themselves family-friendly and I’m pleased to see hatred and violence being stamped out. I'd love to see attendance rise in the lower leagues and I'm very happy to see the rise of the women's game. Of course, there are still problems and I expect them to persist, with trouble sporadically occurring. My expectation is, attendances will rise as the game day experience becomes better for everyone.

Similar posts you might like

Saturday, July 26, 2025

Police and sharks: how not knowing how your data is processed can lead you badly astray

Inland shark attacks and police raids

I saw a post on LinkedIn that purported to show the frequency of shark attacks in the United States. There were little red dots along the coast line in all the areas you might expect. But there was one dot way inland, roughly in South Dakota, over 1,000 miles from the nearest ocean. People on Linked in were chiming in with explanations, but I knew why that dot was there. I knew it represented a warning that you need to know how your data's been processed if you're to understand it properly. I also knew why the same problem had led to a couple going through the living nightmare of multiple police raids for no reason. Let me tell you a couple of related stories of data processing problems and what they mean for analysis, and of course, shark attacks in South Dakota.

(Canva)

Latitude and longitude of IP addresses

Company X (I'm withholding the real name of the company) collected and sold data on the geographic location of IP addresses. For example, it might determine that an IP address is located in Arlington, Virginia or Lyons, France. The company used a number of methods to determine location, but despite their best efforts, there were some IP addresses they could only resolve to the country level or to a rough geographic area.

The company's location data gave the latitude and longitude of the IP's geographic location. That's great for IPs it can nail down to a specific location, like Moscow, Russia, but what about IPs that it could only locate at a more granular level? In these cases, it gave the geographical center of the region or country. In the case of IPs it could only locate to the US, it assigned them the latitude and longitude of the geographical center of the US (in South Dakota) or the latitude and longitude of the geographical center of the contiguous US (in Kansas). That's where the trouble started.

(Geographical center of the contiguous US - Wikipedia.)

Police raids

There's a lot of criminal activity on the internet. The police can sometimes track the IP addresses criminals use; if they can get a latitude and longitude for the IP address, they can locate the criminals. It sounds good doesn't it? It's a staple of TV shows where they show the bad guy tracked to a specific address. But, as you've probably guessed, there's a problem.

An innocent couple bought a farm near the geographic center of the contiguous US. This happened to correspond to the latitude and longitude Company X used for IP addresses it could only resolve to the contiguous US. 

Various law enforcement agencies had the IP addresses criminals were using, and using Company X's data, they thought they'd found the location of the criminals. Of course, some of these IPs were only resolved at the country level, so the latitude and longitude pointed to the couple's farm. Law enforcement raided their farm numerous times (see for example https://www.bbc.com/news/technology-37048521), making their life a misery. 

After the couple took legal action, Company X solved the problem by moving the center latitude and longitude to the middle of a lake and taking other steps you can read about online. They reached a settlement with the couple.

Huge IP clusters in the middle of nowhere

I have some personal experience of this geo-IP problem. I had geo-IP data I was analyzing and I was looking at the data for the UK. It showed a huge concentration of IP addresses in the north of England, in a forest. I knew this was wrong. After some digging, I found it was the same issue; the data supplier was geolocating IP addresses it couldn't precisely locate to the geographical center of the UK. I spotted the problem and changed my data processing to properly handle this kind of low-resolution data. I was running the project, so making changes was easy. Had I not known the UK though, I might have though there was a data center there or maybe theorized some other explanation.

Inland shark attacks

The internet shark attack data was geo data. You might expect that you would know for sure the location of a shark attack, but sometimes data is partially or badly entered. If shark attack data in the US is incompletely entered and the location isn't specified, then the most precise location will be the US (country-level location, nothing more), and the latitude and longitude of the attack will be the geographical center of the US, which is in South Dakota.

(LinkedIn image - I couldn't find the original image or the copyright holder)

I've replicated the LinkedIn image above. Note that the original poster(s) didn't link back to the origin of the chart or of the data set, so it's impossible to do any checking. The data set is supposedly shark attacks in the US, but other countries are shown too, except Canada for some reason. This "USA" chart leaves off Alaska and Hawaii, and I'm sure there are shark attacks in Hawaii too. I did my own sleuthing, but I couldn't find the origin of the chart.

There were a couple of notable features about the LinkedIn discussion:

  • No one challenged the poster on the origin, consistency, or validity of the data set and chart.
  • The discussion about causes was very ill-informed, with almost no one suggesting data collection issues as the cause.

The minute I saw this chart, I knew data collection and entry was the likely cause. The only surprise for me was the lack of more inland dots, for example, there should have been dots for the center of New York state etc. representing attacks where only the state is known. 

COVID data

A similar sort of data collection issue happened during COVID. In many places, coroner's offices are closed over the weekend. This means deaths might be registered as happening on Monday, when the death happened on a Friday night. Of course, there can also be reporting delays etc. (see https://ourworldindata.org/excess-mortality-covid). The net effect is, COVID death figures updated unevenly with corrections.  People who don't normally watch or handle data didn't understand what was going on. Conspiracy theorists seized on this as evidence of malfeasance.  

Malware data

It's not just sharks, police raids, and COVID. I was analyzing some malware data and I noticed some oddities. Attacks were bunched around certain times in a very weird pattern. I also had trouble matching the timestamp of attacks with other data sources I had on the same attacks.  It turns out, the data collection process reassigned the timestamp and it was a processing timestamp not an observation timestamp, which was why the times were so weird; different attack data were being processed at the same time. The bottom line for me was, I couldn't trust the malware data because of the way it had been prepared.

Provenance, chain of custody, and false accuracy

You should never entirely trust your data. All kinds of processing issues can give strange results. You can end up trying to explain shark attacks in South Dakota rather than knowing you're looking at data entry errors or processing issues.

In the geo-IP case, location was given by country/region/town fields and latitude and longitude. If the region and/or town wasn't known, the data was null, so the lack of certainty was correctly coded. However, the latitude and longitude were still precisely given. This story points out the need to properly understand your data, how it's coded, and its level of accuracy.

I've seen similar issues with number of digits. People have assumed the time of an event is known very precisely because exact timestamps are given, but instead the timestamp is driven by the needs of the data processing or storage system. For example, the timing of a modern volcanic eruption might be known and stored in a database. The timestamp field might require a specific data or time. This is great for modern eruptions, but if the database is extended to include ancient eruptions, you either change the database format or give specific dates and times for eruptions thousands or millions of years ago. In these kinds of systems, the year is often given in units of 10, so 1600 CE, or 10,000 BCE, and the time is often given as 12:00 or 0:00. If you're doing analysis on volcanic data and you're doing data and time analysis, you could get some very weird results as a result.

If you're analyzing data, here's what you need to do:

  • Understand where it comes from (provenance). Do you trust the supplier?
  • Understand how it was collected and processed (chain of custody). 
    • Do the fields in the data mean what you think they mean?
    • Do data storage requirements affect the data?
  • Check that the accuracy of the data is consistent across fields and across observations.
Know your data!