Showing posts with label soccer. Show all posts
Showing posts with label soccer. Show all posts

Monday, November 8, 2021

Football crazy: predicting Premier League football match results

I can get a qualification and be rich!

A long time ago, I was part of a gambling syndicate. A friend of mine had some software that predicted the results of English football (soccer) matches and at the time, betting companies offered fixed-price odds for certain types of bets. My friend noticed his software predicted 3-2 away wins more often than the betting company's odds would suggest. Over the course of a season, we had a 20% return on our gambling investment. 

During the COVID lockdown, I took the opportunity to learn R and did a long course that included a capstone project. I decided to see if I could forecast English Premier League (EPL) matches. If I succeeded, I could get a qualification and get rich too! What's not to like? Here's the story of what I did and what happened.

Premier League data

There's an eighteenth-century recipe for a hare dish that supposedly includes the instructions "First, catch your hare." The first step in any project like this is getting your data.

I got match results going back to the start of the league (1993) from football-data. The early data is only match results, but later data includes red cards and some other measurements.

TransferMarkt has data on transfer fees, foreign-born players, and team age, but the data's only available from 2011.

At the time of the project, I couldn't find any other free data sources. There were and are paid-for sources, but they were way beyond what I was willing to pay.

I knew going into the next phase of the project that this wasn't a very big data set with not that many fields. As it turned out, data was a severely limiting factor.

What factors are important?

I had a set of initial hypotheses for factors that might be important for final match scores, here are most of them:

  • team cost - more expensive teams should win more games
  • team age - younger teams perform better
  • prior points - teams with more points win against teams with fewer points
  • foreign-born players - the more non-English players on the team, the more the team will win
  • previous match results - successful (winning) teams win more matches
  • home-field advantage
  • disciplinary record - red and yellow card history might be an indicator of risk-taking
  • season effects - as the season wears on, teams take more risks to win matches

I found evidence that most of these did in fact have an impact.

Here's strong evidence of home-field advantage. Note how it goes away during the 2020-2021 season when matches were played without fans.

Here's goal difference vs. team cost difference. The more expensive team tends to win.

Here's goal difference vs. mean prior goal difference. Teams that scored more goals before tend to score more goals in their current match.

I found more relationships you can read about if you're interested.

Machine learning

Thinking back to my gambling syndicate, I decided to forecast the score of each match rather than just win/lose/draw. My loss function was the RMSE of the goal difference between the predicted score and the actual score. To avoid COVID oddities, I removed the 2020-2021 season (the price being a smaller data set). Of course, I used a training and holdout dataset and cross-validation. 

The obvious question is, which model machine learning models work? I decided to try a whole bunch of them:

  • Naive mean score model. A simple model that’s just the mean scores of the (training) data set.
  • Generalized Linear Model. A form of ordinary linear regression.
  • Glmnet. Fits lasso and elastic-net regularized generalized linear models.
  • SVM. Support Vector Machines - boundary-based regression. After some experimentation, I selected the svmRadial form of SVM, which uses a non-linear kernel function.
  • KNN. K-nearest neighbors. Given that EPL scores are all in close proximity to one another, we might expect this model to return good results.
  • Neural nets.
  • XGB Linear. This is linear modeling with extreme gradient boosting. Extreme gradient boosting has gathered a lot of attention over the last few years and may be one of the most used machine learning models today.
  • XGB Tree. This is a decision tree model with extreme gradient boosting.
  • Random Forest.

The model results weren't great. For the KNN model, here's how the RMSE for full-time away goals varied with n.

Note the RMSE scale - the lowest it goes to is 1.1 goals and it's plain that adding more n will only take us a little closer to 1.1. Bear in mind, football is a low-scoring game, and being off by 1 goal is a big miss.

It was the same story for random forest.

In fact, it was the same story for all of the models. Here are my final results. My model forecast home goals and away goals.

The naive means model is the simplest and all my sophisticated models could do is give me a few percentage points improvement.

Improving the results

Perhaps the most obvious way forward is combining models to improve RMSE. I'm reluctant to do that until I can get better individual model results. There's a philosophical issue at play; for me, the ensemble approach feels a bit "spray and pray".

In my view, data shortage is the main problem:

  • My data set was only in the low thousands of matches. 
  • Some teams join the Premier League for just a season and then get relegated - I don't model their history prior to joining the league. 
  • I removed the COVID season of 2020-2021. 
  • I only had team value and disciplinary data for ten or so seasons. 
  • Of course, I only modeled the Premier League.

Football is a low-scoring game, famous for its upsets. It may well be that it's just too random underneath to make useful predictions at the individual match level. 

What's next?

I wasn't able to predict EPL results with any great accuracy, but I submitted my report and got my grade. If you want to read my report, you can read it here.

At the end of the 2021 season, I saw some papers published on the COVID effect on match results. I had similar results months before. Perhaps I should have submitted a paper myself.

At some point, I might revive this project if I can get new data. I still occasionally hunt for new data sources, but sadly, I haven't found any. My dreams of retiring to a yacht on the Mediterranean will have to wait.

Monday, January 4, 2021

COVID and soccer home team advantage - winning less often

Home advantage

Is it easier for a sports team to win at home? The evidence from sports as diverse as soccer [Pollard], American football [Vergina], rugby [Thomas], and ice hockey [Leard] strongly suggest there is a home advantage and it might be quite large. But what causes it? Is it the crowd cheering the home team, or closeness to home, or playing on familiar turf? One of the weirder side-effects of COVID is the insight it's proving into the origins of home advantage, as we'll see.

(Premier League teams playing in happier times. Image source: Wikimedia Commons, License: Creative Commons, Author: Brian Minkoff)

The EPL - lots of data makes analysis easier

The English Premier League is the world's wealthiest sports' league [Robinson].  There's worldwide interest in the league and there has been for a long time, so there's a lot of data available, which makes it ideal for investigating home advantage. One of the nice features of the league is that each team plays every other team twice, once at home and once away.

Expectation and metric

If there were no home team advantage, we would expect the number of home wins and away wins to be roughly equal for the whole league in a season. To investigate home advantage, the metric I'll use is:
\[home \ win \ proportion = \frac{number\ of\ home\ wins}{total\ number\ of\ wins}\]
If there were no home team advantage, we would expect this number to be close to 0.5.

EPL home team advantage

Let's look at the mean home win proportion per season for the EPL. In the chart, the error bars are the 95% confidence interval.
For most seasons, the home win proportion is about 0.6 and it's significantly above 0.5 (in the statistical sense). In other words, there's a strong home-field advantage in the EPL.

But look at the point on the right. What's going on in 2020-2021?

COVID and home wins

Like everything else in the world, the EPL has been affected by COVID. Teams are playing behind closed doors for the 2020-2021 season. There are no fans singing and chanting in the terraces, there are no fans 'ohhing' over near misses, and there are no fans cheering goals. Teams are still playing matches home and away but in empty and silent stadiums.

So how has this affected home team advantage?

Take a look at the chart above. The 2020-2021 season is the season on the right. Obviously, we're still partway through the season, which is why the error bars are so big, but look at the mean value. If there were no home team advantage, we would expect a mean of 0.5. For 2020-2021, the mean is currently 0.491. 

Let me put this simply. When there are fans in the stadiums, there's a home team advantage. When there are no fans in the stadiums, the home team advantage disappears.

COVID and goals

What about goals? It's possible that a team that might have lost is so encouraged by their fans that they reach a draw instead. Do teams playing at home score more goals?

I worked out the mean goal difference between the home team and the away team and I've plotted it for every season from 2000-2001 onwards.
If there were no home team advantage, you would expect the goal difference to be 0. But it isn't. It mostly hovers around 0.35. Except of course for 2020-2021. For 2020-2021, the goal difference is about zero. The home-field advantage has gone.

What this means

Despite the roll-out of the vaccine, it's almost certain the rest of the 2020-2021 season will be played behind closed doors (assuming the season isn't abandoned). My results are for a partial season, but it's a good bet the final results will be similar. If this is the case, then it will be very strong evidence that fans cheering their team really do make a difference.

If you want your team to win, you need to go to their games and cheer them on. 

References

[Leard] Leard B, Doyle JM. The Effect of Home Advantage, Momentum, and Fighting on Winning in the National Hockey League. Journal of Sports Economics. 2011;12(5):538-560.

[Pollard] Richard Pollard and Gregory Pollard, Home advantage in soccer: a review of its existence and causes, International Journal of Soccer and Science Journal Vol. 3 No 1 2005, pp28-44

[Robinson] Joshua Robinson, Jonathan Clegg, The Club: How the English Premier League Became the Wildest, Richest, Most Disruptive Force in Sports, Mariner Books, 2019

[Thomas] Thomas S, Reeves C, Bell A. Home Advantage in the Six Nations Rugby Union Tournament. Perceptual and Motor Skills. 2008;106(1):113-116

[Vergina] Roger C.Vergina, John J.Sosika, No place like home: an examination of the home field advantage in gambling strategies in NFL football, Journal of Economics and Business Volume 51, Issue 1, January–February 1999, Pages 21-31