Showing posts with label football. Show all posts
Showing posts with label football. Show all posts

Monday, December 15, 2025

Poisson to predict football results?

Goals are Poisson distributed?

I've read a lot of literature that suggests that goals in games like football (soccer) and hockey (ice hockey) are Poisson distributed. But are they? I've found out that it's not as simple as some of the papers and articles out there suggest. To dig into it, I'm going to define some terms and show you some analysis.

The Poisson distribution

The Poisson distribution is a discrete distribution that shows the probability distribution of the number of independent events occurring over a fixed time period or interval. Examples of its use include: the number of calls in a call center per hour, website visits per day, and manufacturing defects per batch. Here's what it looks like:

If this were a chart of defects per batch, the x-axis would be the number of defects and the y-axis would be the probability of that number of defects, so the probability of 2 defects per batch would be 0.275 (or 27.5%).

Here's it's probability mass function formula:

\[PMF = \frac{ \lambda^{k}e^{-\lambda}}{k!} \]

Modeling football goals - leagues and seasons

A lot of articles, blogs, and papers suggest that football scores are well-modeled by the Poisson distribution. This is despite the fact that goals are not wholly independent of one another; it's well-known that scoring a goal changes a game's dynamics. 

To check if the Poisson distribution models scores well, here's what I did.

  1. Collected all English football league match results from 1888 to the present. This data includes the following fields: league_tier, season, home_club, home_goals, away_club_away_goals.
  2. Calculated a field total_goals (away_goals + home_goals).
  3. For each league_tier and each season, calculated relative frequency for total_goals, away_goals, and home_goals.
  4. Curve fit a Poisson distribution to the data.
  5. Calculated \(\chi^2\) and the associated p-value.

This gives me a dataframe of \(\chi^2\)  and p for each league_tier and season. In other words, I know how good a model the Poisson distribution is for goals scored in English league football.

This is the best fit (lowest \(\chi^2\) for total_goals). It's for league_tier 2 (the EFL Championship) and season 2022-2023. The Poisson fit here is very good. There are a lot of league_tiers and seasons with pretty similar fits.

Here's the worst fit (hightest \(\chi^2\) for total_goals). It's for league_tier 2 (the Second Division) and the 1919-1920 season (the first one after the first world war). By eye, it's still a reasonable approximation. It's an outlier though; there aren't many league_tiers and seasons with fits this bad.


Overall, it's apparent that the Poisson distribution is a very good way of modeling football results at the league_tier and season level. The papers and articles are right. But what about at the team level?

Modeling goals at the club level

Each season, a club faces a new set of opponents. If they change league tier (promotion, relegation), their opponents will be pretty much all new. If they stay in the same league, some opponents will be different (again due to promotion and relegation). If we want to test how good the Poisson distribution is at modeling results at the club level, we need to look season-by-season. This immediately introduces a noise problem; there are many more matches played in a league tier in a season than an individual club will play.

Following the same sort of process as before, I looked at how well the Poisson models goals that the club level. The answer is: not well.

The best performing fit has a low \(\chi^2\) = 0.05, the worst has a value of 98643. This is a bit misleading though, a lot of the fits are bad. Rather than show you the best and the worst, I'll just show you the results for one team and one season: Liverpool in 2024-2025.

(To clarify, total goals is the total number of goals scored in a season by a club, it's the sum of their home goals and their away goals.)

I checked the literature for club results modeling and I found that some authors found a Poisson distribution at the club level if they modeled the data over several seasons. I have mixed feelings about this. Although conditions vary within a season, they're more consistent than across different seasons. Over a period of several years, a majority of the players might have changed and of course, the remaining players will have aged. Is the Arsenal 2019 team the same as the Arsenal 2024 team? Where do you draw the line? On the other hand, the authors did find the Poisson distribution fit team results when aggregating over multiple seasons. As with all things in modeling sports results, there are deeper waters here and more thought and experimentation is required.

Although my season-by-season club fit \(\chi^2\) values aren't crazy, I think you'll agree with me that the fit isn't great and not particularly useful. Sadly, this is the consistent story with this data set. The bottom line is, I'm not sure how useful the Poisson distribution is for predicting scores at the club level for a single season.

Some theory that didn't work

It could be noise driving the poor fit at the club level, which is a variant of the "law of small numbers", but it could be something else. Looking at these results, I'm wondering if this is a case of the Poisson Limit Theorem. The Poisson Limit Theorem is simple: it states as the number of trials in a Binomial distribution increases towards infinity, the distribution tends to the Poisson distribution. In other words, Binomial distributions look like Poisson distributions if you have enough data.

The obvious thing to do is to try fitting the data using the Binomial distribution instead. If the Binomial doesn't fit any better, it's not the Poisson Limit Theorem. 

I tried fitting the club data using the Binomial distribution and I got fractionally better results, but not enough that I would use the Binomial distribution for any real predictions. In other words, this isn't the Poisson Limit Theorem at work.

I went back to all the sources that spoke about using the Poisson distribution to predict goals. All of them used data aggregated to the league or season level. One or two used the Poisson to try and predict who would end up at the top of a league at the end of the season. No one showed results at the club level for a single season or wrote about club-level predictions. I guess I know why now.

Some thoughts on next steps

There are four things I'm mulling over:

  • The Poisson distribution is a good fit for a league tier for a season.
  • I don't see the Poisson distribution as a good fit for a club for a season.
  • Some authors report the Poisson distribution is a fit for a club over several (5 or more) seasons. But clubs change over time, sometimes radically over short periods.
  • The Poisson Limit Theorem kicks in if you have enough data.
A league tier consists of several clubs, right now, there are 20 clubs in the Premier League. By aggregating the results over a season for 20 unrelated clubs, I get data that's well-fitted by the Poisson distribution. I'm wondering if the authors who modeled club data over five or more seasons got it right for the wrong reason. What if they aggregated the results of 5 unrelated clubs in the same season or even, different season? In other words, did they see a fit to multi-season club data because of aggregation alone? 

Implications for predicting results

The Poisson distribution is a great distribution to use to model the goals scores at the league and season level, but not so much at the club-level. The Binomial distribution doesn't really work at the club-level either. It may well be each team plays too few matches in a season for us to fit their results using an off-the-shelf distribution. Or put another way, randomness is too big an element of the game to let us make quick and easy predictions.

Friday, September 26, 2025

More money means more goals

Winner takes all?

Do clubs with the most expensive players score more goals in English league football? The answer is a strong yes.

In this blog post, I'll show an analysis of goals scored vs. club transfer value and you'll clearly see a strong correlation. Of course, it's not the only factor that affects goals scored, but it's a strong signal.

(Google Gemini. Note the Euro has three legs!)

The data

The data comes from TransferMarkt (https://www.transfermarkt.com/) who publish a market values for clubs. The market value is the estimated transfer value of all the players in the club squad. Obviously, transfer values change over time when players are bought, sold, or are injured. TransferMarkt have club transfer values at the start of each season and they also provide biweekly values. For this analysis, I've used the season start values. The dataset starts properly in 2010 for the top four tiers.

The charts

The charts below show goals for, against, and net (for - against) vs. total club transfer value for each club for each season for each league. The slider lets you change the year and the buttons let you change the league tier. The points on the charts are individual clubs and the line is a linear regression fit. The r2 and p-value for the fit are in the chart title. The blue band is the 95% confidence interval on the fit.

In addition to the buttons and slider, the charts are interactive:

  • You can hover over points and see their values.
  • You can zoom-in or zoom-out using the tool menu on the left.
  • You can save the charts using the tools menu on the left.

Take a while to play with the charts.

What the charts show

All leagues show the following trends:

  • Higher club value = more for goals
  • Higher club value = fewer against goals
  • Higher club value = more net goals 

The strength of this correlation varies by league and by time, but it's there.

The r2 value varies in the range 0.4 to 0.91, suggesting a good correlation, but it's not the only factor; there are other factors we need to consider to fully model goals. The p-values are close to 0, indicating this correlation is very unlikely to have happened by chance.

Take a look at league tier 3 for 2024 (this is currently called "League One"). There's a huge outlier and it's Birmingham City. These guys were in the Premier League not so long ago, but suffered a number of problems on and off the pitch which led to their relegation. They've recently had a big cash injection are are now owned (in part) by Tom Brady. Part of this big cash injection was new management and new players. As a result, they were promoted back to the EFL Championship (tier 2) in 2025. In other words, they're a big club temporarily fallen on hard times; they're an outlier.

If you take a look at tier 2, you'll see the top valued clubs are pretty much all clubs recently relegated from the Premier League. To play in the Premier League, you need top-quality talent, and that's expensive. On the flip side, you get more gate revenue and TV money. Relegated teams face a number of issues: star players may leave and revenues drop precipitously. To stand any chance of being promoted, clubs need to retain top-talent at the same time as their revenue has fallen. These conflicting requirements can and has led to financial instability. To ease the relegation transition, the Premier League provides "parachute" payments to relegated clubs.  The upshot is, newly relegated teams are in a better place than the other clubs in the league; they have parachute money and good players.

Children's fiction, Ted Lasso, and Wrexham 

Growing up in England, there was a lot of football fiction aimed at kids. A staple of the genre was a struggling team that somehow make it to the top, out-playing bigger and more expensive teams. Sadly, this just isn't the reality and probably never was; money is pretty much the only way up. Looking back, I'm not sure the financial underdog fantasy was helpful.

Both the fictional Ted Lasso and the real Wrexham are in the news. Notably, neither Ted Lasso nor Wrexham are rags-to-riches tales. 

In Ted Lasso, the fictional Richmond team owner brought in Ted Lasso to tank the team performance to spite her ex-husband. The team had plenty of money (lack of money was never a major story line). Perhaps the writers felt that having a cheap team rise to the top would be too unrealistic. 

Wrexham's upward path has been paid for by Hollywood money, and in fact Wrexham's club value is pretty typical of a League One team, they're very much not the financial underdog. 

The rags-to-riches fantasy, or maybe, the financial underdog-wins-all fantasy, is just a fantasy.

The bottom line

The bottom line is the bottom line. Money talks, and if you want to score the goals, you've got to spend the cash.