Predicting goals from data
I've been researching features that have some predictive power for goal scoring in English league football and I looked at number of seasons in a league as a possible factor. What I found surprised me. My results clearly show the Premier League is different and has been getting increasingly so over time. What I've found also provides very bad news for newly promoted teams as we'll see.
I'll start with a few necessary definitions and the key charts, then I'll tell you what to look for in the charts so you can see for yourself. I've included a short briefing on English football at the end if you don't already have the background.
Analysis
For each club in each league, I calculate the number of goals they scored per season ("For goals"), the number of goals their opponents scored ("Against goals"), and the goal difference ("Net goals").
I calculated the number of contiguous seasons each club has been in its current league. Promotion and relegation "resets" the clock which means that a newly promoted or relegated club has only been in its current league for one year. Here's an example; for the 2024-2025 season, Fulham have been in the Premier League for 3 seasons. They've been in the Premier League before that, but their relegation/promotion reset the clock so those years don't count for its current seasons in league.
I did this analysis for the top five leagues for every team for every year. The results are in the charts below.
The charts
The charts show "For goals", "Against goals", and "Net goals" vs seasons in league. Promoted teams are shown with green dots, relegated teams with red dots, and teams with no change are blue. Note the seasons in league axis is a log scale.
The black line is a log-linear fit to "guide the eye". I haven't provided goodness-of-fit data because the fit is non-linear and all I want is a general indication of the trend in the data. The fit excludes newly relegated teams.
The charts are interactive. You can:
- Hover over the points and you'll see the underlying data.
- Using the menu on the left, you can zoom in and out on the charts, select regions, download the charts. etc.
- Use the league tier widget to change the league (currently, League Tier 1 is the "Premier League" etc.).
- Change the year using the slider. Note there are no results for the war years and the leagues were formed in different years.
There's a lot going on here, so play with the charts and I'll break down the significance in the next section.
What the charts tell us
Longevity and clusters
Look at the Premier League (league tier 1) over the last twenty years. The data shows the emergence of three clusters or cohorts I'll label "veteran", "seniors", and "new boys".
The veterans have a tenure of 23 years or more in the top tier. Astonishingly, Arsenal has been in the top flight for 99 continuous years (excluding war time)! Think for a minute about companies in competitive industries, how many do you know that have stayed at the top for this long? Of course, it's not difficult to find English clubs that have been around for over a hundred years, so the age of the club is immaterial. What marks Arsenal out is how long they've been at the top without dropping down. How Arsenal (and the "veterans" cluster in general) has stayed at the top for so long must be worth several business school case studies.
Go back in time with the year slider and you'll notice this clustering becoming less obvious. It really only emerges after the formation of the Premier League in 1993. By 2025, it's become very strong.
Now look at the lower leagues (tiers 2 to 5). You'll notice three things:
- There's no strong clustering.
- There's no strong relationship between seasons in league and goals.
- The maximum seasons in league drops substantially.
Relegation and promotion
Over recent years, teams newly promoted into the Premier League have been relegated at the end of the season and we can see this clearly in the data. Newly promoted teams ("new boys") just haven't performed as well as the "seniors" and "veterans". Importantly, this hasn't always been the case. Go back to 1992 and you'll see a different pattern; newly promoted teams performed roughly on par with other clubs in the league.
Compare the Premier League (tier 1) to the lower tiers and you'll see a contrasting story. The "revolving door" isn't there in the lower leagues. In fact, newly promoted clubs tend to perform roughly the same as clubs that have been in the league longer. This suggests more equality within a league, but also a smaller gap between league 2 and the lower leagues.
Clubs relegated to the EFL Championship from the Premier League tend to do well, as you might expect. There's a small group of clubs that seem to revolve between the Premier League and the Championship.
The data shows a widening gap between the Premier League and the EFL Championship (tier 2), but not between leagues lower down.
Conclusions
In the Premier League in recent years, three cohorts have emerged: "veterans", "seniors", and "new boys". There's an increasing gap between the established Premier League teams and the "new boys" coming up from the Championship; the "new boys" face an uphill struggle to survive and may well be relegated quickly. If I were doing any kind of cluster analysis, this would be a good place to start, however, I care about goal scoring features and over the last few years, tenure in the Premier League is a useful predictor of goals.
The lower leagues are much more equal. Newly promoted teams fare better and tenure in the league isn't a useful predictor of goals.
As a throw away thought, would it be "easier" to predict Premier League match results than say EFL Championship results? The Premier League seems to "favor" established clubs in a way the other leagues don't.
All my analysis so far points to the same conclusion: the Premier League is different and has been getting more so over time.
English league football background
English league football started in 1888 with a single division. The game was successful and there were a number of non-league teams that wanted to join, so in 1892, a new second division was formed. In 1958, the league created a third and fourth division. Very helpfully, these leagues were named "First Division", "Second Division", etc. In 1979, the league created a fifth division called the 'Alliance Premier League'. There are leagues below the lowest tier and there have been since very early on. The number of clubs in each league varied over time for various reasons.
Since the formation of the Second Division in 1892, there's been promotion and relegation between leagues. The top teams go up to the league above and the bottom teams go down to the league below. How many teams get relegated/promoted varies from year to year, as do the rules.
The Premier League was created in 1993 as a breakaway league by the then First Division clubs. This set off a chain reaction of league name changes over time. Instead of the old, easy to understand "First Division", "Second Division" etc., we now have "Premier League", "EFL Championship", "League One", "League Two", and the "National League". These leagues have had other names since 1993.
During World Wars I and II, the leagues were suspended in their usual form. For the purposes of analysis, I've ignored these years.
No comments:
Post a Comment