I can get a qualification and be rich!

A long time ago, I was part of a gambling syndicate. A friend of mine had some software that predicted the results of English football (soccer) matches and at the time, betting companies offered fixed-price odds for certain types of bets. My friend noticed his software predicted 3-2 away wins more often than the betting company's odds would suggest. Over the course of a season, we had a 20% return on our gambling investment.

(Ronnie Macdonald from Chelmsford, United Kingdom, CC BY 2.0, via Wikimedia Commons)

During the COVID lockdown, I took the opportunity to learn R and did a long course that included a capstone project. I decided to see if I could forecast English Premier League (EPL) matches. If I succeeded, I could get a qualification and get rich too! What's not to like? Here's the story of what I did and what happened.

Premier League data

There's an eighteenth-century recipe for a hare dish that supposedly includes the instructions "First, catch your hare." The first step in any project like this is getting your data.

(Robert Scoble from Half Moon Bay, USA, CC BY 2.0, via Wikimedia Commons)

I got match results going back to the start of the league (1993) from football-data. The early data is only match results, but later data includes red cards and some other measurements.

TransferMarkt has data on transfer fees, foreign-born players, and team age, but the data's only available from 2011.

At the time of the project, I couldn't find any other free data sources. There were and are paid-for sources, but they were way beyond what I was willing to pay.

I knew going into the next phase of the project that this wasn't a very big data set with not that many fields. As it turned out, data was a severely limiting factor.

What factors are important?

I had a set of initial hypotheses for factors that might be important for final match scores, here are most of them:

team cost - more expensive teams should win more games
team age - younger teams perform better
prior points - teams with more points win against teams with fewer points
foreign-born players - the more non-English players on the team, the more the team will win
previous match results - successful (winning) teams win more matches
home-field advantage
disciplinary record - red and yellow card history might be an indicator of risk-taking
season effects - as the season wears on, teams take more risks to win matches

I found evidence that most of these did in fact have an impact.

Here's strong evidence of home-field advantage. Note how it goes away during the 2020-2021 season when matches were played without fans.

Here's goal difference vs. team cost difference. The more expensive team tends to win.

Here's goal difference vs. mean prior goal difference. Teams that scored more goals before tend to score more goals in their current match.

I found more relationships you can read about if you're interested.

Machine learning

Thinking back to my gambling syndicate, I decided to forecast the score of each match rather than just win/lose/draw. My loss function was the RMSE of the goal difference between the predicted score and the actual score. To avoid COVID oddities, I removed the 2020-2021 season (the price being a smaller data set). Of course, I used a training and holdout dataset and cross-validation.

The obvious question is, which model machine learning models work? I decided to try a whole bunch of them:

Naive mean score model. A simple model that’s just the mean scores of the (training) data set.
Generalized Linear Model. A form of ordinary linear regression.
Glmnet. Fits lasso and elastic-net regularized generalized linear models.
SVM. Support Vector Machines - boundary-based regression. After some experimentation, I selected the svmRadial form of SVM, which uses a non-linear kernel function.
KNN. K-nearest neighbors. Given that EPL scores are all in close proximity to one another, we might expect this model to return good results.
Neural nets.
XGB Linear. This is linear modeling with extreme gradient boosting. Extreme gradient boosting has gathered a lot of attention over the last few years and may be one of the most used machine learning models today.
XGB Tree. This is a decision tree model with extreme gradient boosting.
Random Forest.

The model results weren't great. For the KNN model, here's how the RMSE for full-time away goals varied with n.

Note the RMSE scale - the lowest it goes to is 1.1 goals and it's plain that adding more n will only take us a little closer to 1.1. Bear in mind, football is a low-scoring game, and being off by 1 goal is a big miss.

It was the same story for random forest.

In fact, it was the same story for all of the models. Here are my final results. My model forecast home goals and away goals.

The naive means model is the simplest and all my sophisticated models could do is give me a few percentage points improvement.

Improving the results

Perhaps the most obvious way forward is combining models to improve RMSE. I'm reluctant to do that until I can get better individual model results. There's a philosophical issue at play; for me, the ensemble approach feels a bit "spray and pray".

In my view, data shortage is the main problem:

My data set was only in the low thousands of matches.
Some teams join the Premier League for just a season and then get relegated - I don't model their history prior to joining the league.
I removed the COVID season of 2020-2021.
I only had team value and disciplinary data for ten or so seasons.
Of course, I only modeled the Premier League.

Football is a low-scoring game, famous for its upsets. It may well be that it's just too random underneath to make useful predictions at the individual match level.

What's next?

I wasn't able to predict EPL results with any great accuracy, but I submitted my report and got my grade. If you want to read my report, you can read it here.

At the end of the 2021 season, I saw some papers published on the COVID effect on match results. I had similar results months before. Perhaps I should have submitted a paper myself.

At some point, I might revive this project if I can get new data. I still occasionally hunt for new data sources, but sadly, I haven't found any. My dreams of retiring to a yacht on the Mediterranean will have to wait.

Engora Data Blog

Monday, November 8, 2021

Football crazy: predicting English Premier League football match results