Showing posts with label machine learning. Show all posts
Showing posts with label machine learning. Show all posts

Monday, November 8, 2021

Football crazy: predicting Premier League football match results

I can get a qualification and be rich!

A long time ago, I was part of a gambling syndicate. A friend of mine had some software that predicted the results of English football (soccer) matches and at the time, betting companies offered fixed-price odds for certain types of bets. My friend noticed his software predicted 3-2 away wins more often than the betting company's odds would suggest. Over the course of a season, we had a 20% return on our gambling investment. 

During the COVID lockdown, I took the opportunity to learn R and did a long course that included a capstone project. I decided to see if I could forecast English Premier League (EPL) matches. If I succeeded, I could get a qualification and get rich too! What's not to like? Here's the story of what I did and what happened.

Premier League data

There's an eighteenth-century recipe for a hare dish that supposedly includes the instructions "First, catch your hare." The first step in any project like this is getting your data.

I got match results going back to the start of the league (1993) from football-data. The early data is only match results, but later data includes red cards and some other measurements.

TransferMarkt has data on transfer fees, foreign-born players, and team age, but the data's only available from 2011.

At the time of the project, I couldn't find any other free data sources. There were and are paid-for sources, but they were way beyond what I was willing to pay.

I knew going into the next phase of the project that this wasn't a very big data set with not that many fields. As it turned out, data was a severely limiting factor.

What factors are important?

I had a set of initial hypotheses for factors that might be important for final match scores, here are most of them:

  • team cost - more expensive teams should win more games
  • team age - younger teams perform better
  • prior points - teams with more points win against teams with fewer points
  • foreign-born players - the more non-English players on the team, the more the team will win
  • previous match results - successful (winning) teams win more matches
  • home-field advantage
  • disciplinary record - red and yellow card history might be an indicator of risk-taking
  • season effects - as the season wears on, teams take more risks to win matches

I found evidence that most of these did in fact have an impact.

Here's strong evidence of home-field advantage. Note how it goes away during the 2020-2021 season when matches were played without fans.

Here's goal difference vs. team cost difference. The more expensive team tends to win.

Here's goal difference vs. mean prior goal difference. Teams that scored more goals before tend to score more goals in their current match.

I found more relationships you can read about if you're interested.

Machine learning

Thinking back to my gambling syndicate, I decided to forecast the score of each match rather than just win/lose/draw. My loss function was the RMSE of the goal difference between the predicted score and the actual score. To avoid COVID oddities, I removed the 2020-2021 season (the price being a smaller data set). Of course, I used a training and holdout dataset and cross-validation. 

The obvious question is, which model machine learning models work? I decided to try a whole bunch of them:

  • Naive mean score model. A simple model that’s just the mean scores of the (training) data set.
  • Generalized Linear Model. A form of ordinary linear regression.
  • Glmnet. Fits lasso and elastic-net regularized generalized linear models.
  • SVM. Support Vector Machines - boundary-based regression. After some experimentation, I selected the svmRadial form of SVM, which uses a non-linear kernel function.
  • KNN. K-nearest neighbors. Given that EPL scores are all in close proximity to one another, we might expect this model to return good results.
  • Neural nets.
  • XGB Linear. This is linear modeling with extreme gradient boosting. Extreme gradient boosting has gathered a lot of attention over the last few years and may be one of the most used machine learning models today.
  • XGB Tree. This is a decision tree model with extreme gradient boosting.
  • Random Forest.

The model results weren't great. For the KNN model, here's how the RMSE for full-time away goals varied with n.

Note the RMSE scale - the lowest it goes to is 1.1 goals and it's plain that adding more n will only take us a little closer to 1.1. Bear in mind, football is a low-scoring game, and being off by 1 goal is a big miss.

It was the same story for random forest.

In fact, it was the same story for all of the models. Here are my final results. My model forecast home goals and away goals.

The naive means model is the simplest and all my sophisticated models could do is give me a few percentage points improvement.

Improving the results

Perhaps the most obvious way forward is combining models to improve RMSE. I'm reluctant to do that until I can get better individual model results. There's a philosophical issue at play; for me, the ensemble approach feels a bit "spray and pray".

In my view, data shortage is the main problem:

  • My data set was only in the low thousands of matches. 
  • Some teams join the Premier League for just a season and then get relegated - I don't model their history prior to joining the league. 
  • I removed the COVID season of 2020-2021. 
  • I only had team value and disciplinary data for ten or so seasons. 
  • Of course, I only modeled the Premier League.

Football is a low-scoring game, famous for its upsets. It may well be that it's just too random underneath to make useful predictions at the individual match level. 

What's next?

I wasn't able to predict EPL results with any great accuracy, but I submitted my report and got my grade. If you want to read my report, you can read it here.

At the end of the 2021 season, I saw some papers published on the COVID effect on match results. I had similar results months before. Perhaps I should have submitted a paper myself.

At some point, I might revive this project if I can get new data. I still occasionally hunt for new data sources, but sadly, I haven't found any. My dreams of retiring to a yacht on the Mediterranean will have to wait.

Monday, November 16, 2020

Geese or enemy aircraft? Receiver Operating Characteristic curves in machine learning

In a strange quirk of history, one of the ways of evaluating machine learning algorithms has its roots in World War II and was subsequently used in a range of disciplines, including psychiatry. Only much later was it used in machine learning, but it kept its original name: receiver operating characteristic (ROC). I'm going to look at the history of this technique and explain what it is and why it's so important.

Is it geese or is it enemy planes?

In 1940, the situation in Britain was dire; the country was engaged in a desperate stand against Hitler.  To weaken the country, and break the will of the people, Nazi aircraft heavily bombed British cities, which was the infamous blitz. I've seen estimates of over 43,000 people killed and of course, there was huge damage to Britain's industrial and cultural infrastructure. Newsreel pictures and propaganda of the time give a view of the devastation. Britain stood alone against the Nazi threat; the Battle of Britain was an existential one.

(Office workers in London going to work through bomb damage. Image source: Wikimedia Commons, License: Public Domain.)

It was vital therefore to detect enemy aircraft as quickly as possible, so the British government used a new technology called radar. Radar receivers had a number of settings, for example, you could turn the gain (amplification) up, but what should the correct settings be? Obviously, you want to correctly identify enemy aircraft, but you don't want to identify a flock of geese as aircraft. If you divert limited resources to chasing wild geese, those resources aren't available to pursue the real threat. This is where the receiver operating characteristic curve comes in. It was a way of deciding the best operating point and/or deciding the best receiver.

Ways of being right and wrong

I've covered this before in a previous blog post about the confusion matrix, so I'll just briefly recap here. There are two ways to be right and two ways to be wrong if we're doing a binary classification (geese/enemy aircraft).


Actual
enemy aircraft geese
Prediction enemy aircraft True Positive False Positive
geese False Negative True Negative

From the counts of the True Positives, False Negatives, etc. we can define two quantities:
\[TPR = \frac{TP}{TP + FN} = 1 - FNR\]
\[= True \ Positive \ Rate, sensitivity, recall, hit rate\]
\[FPR = \frac{FP}{FP + TN} = False \ Positive \ Rate, fall out\]
There are an overly large number of other quantities we can define to help us evaluate classification. But these quantities and numbers are points: they allow us to evaluate an algorithm at a point, or under a single operating condition.

A picture is worth a thousand words

The receiver operating characteristic is a plot of the True Positive Rate vs. the False Positive Rate for different settings. Generically, it looks something like this. 

We get a curve by varying a parameter and measuring FNR and TPR at each of the parameter values. In the case of our World War II radar receiver, the parameter could be gain; increasing the gain changes the trade-off between TPR and FPR. 

Let's imaging a receiver that was just a random selector - choosing geese or enemy aircraft based on the toss of a coin. We would expect it to give us a straight line at \(45^o\). Over time, the random selector would tend to the 50-50 point on the straight line. A real receiver has to do better than chance, so it has to be above the random line. In the chart below, the chance line is the black dotted line.

An ideal receiver has very different properties from the chance line. I've indicated an ideal operating curve in red on the chart below - it always gives a 100% True Positive Rate.

The ROC chart allows us to compare the behavior of different algorithms or different receivers. We could draw out the ROC curve for two receivers for example and choose the best one (the highest line). Here's a graphical representation.

A more mathematical way of doing the same thing is to use the ROC curves, but work out an area under the curve (AUC). An ideal receiver has an AUC of 1 (the red line), but obviously, the higher the AUC, the better.

Machine learning

Classifiers enable us to make categorical decisions based on input data. For example, if a user types 'evening wear' into a shopping site, do you show them cocktail dresses or tuxedos? A machine learning algorithm might use the users' browsing behavior to make a guess about male or female clothing. But how correct is the algorithm? This is where ROC curves can be used to understand the degree of correctness and the appropriate algorithmic settings to use.

Uses of ROC curves outside of machine learning

ROC curves are used in a wide range of disciplines:

More tongue-in-cheek, a group of medical researchers in Sydney, Australia used a ROC to find the optimal walking speed for men over 70 to avoid death. If you're interested, the optimal speed is 0.82m/s. 

Limitations of ROC curves - precision-recall

In a previous blog post, I looked at the confusion matrix and talked about prevalence. The idea is simple: a biased data set can give you a false sense of the accuracy of your data. If your data is biased, a precision-recall plot may be more appropriate.

Going back to the confusion matrix, here's how we define precision and recall.

\[Precision = \frac{TP}{TP + FP}\]
\[Recall = \frac{TP}{TP + FN}\]

Here's a typical precision-recall curve.

Because precision gives us an indication of how relevant the results are, precision-recall curves are often used to evaluate information retrieval algorithms.

Despite the long track record for receiver operating characteristic curves, precision-recall curves may be a better evaluation method. However, old habits die hard and ROC curves still reign.

Don't lose sight of the end goal

ROC and precision-recall curves are all about the same thing: figuring out how useful an algorithm is. There are lots of different ways an algorithm can be wrong, which means different ways of investigating correctness. Don't lose sight of the fact that under the hood, machine learning algorithms are probabilistic.

Reading more

https://www.cambridge.org/core/services/aop-cambridge-core/content/view/S1481803500013336

https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.466.7628&rep=rep1&type=pdf

Monday, November 9, 2020

Dazed and confused: the confusion matrix and getting it right and wrong

How correct are my (machine learning) algorithms?

In machine learning, we're using algorithms to make predictions about outcomes based on input data. For example, given that a consumer at an online store views dog collars and dog leads, you might show them dog food if they search for 'pet food'. This is fairly obvious, but what if they then searched for evening wear, would you show them cocktail dresses or tuxedos? 

(The confusion matrix can be confusing. Image source: Pixabay. Author: Erika Wittlieb. License: Pixabay license)

The confusion matrix is about quantifying the correctness of algorithms, but it's not sufficient of itself. Fortunately, there are quantities we can derive from the confusion matrix that will show up certain types of error as we'll see.

The confusion matrix

I'm going to use the example of an online store that sells pet products. Imagine an algorithm that tries to decide if a consumer has a cat or not. There are two ways the algorithm can be right and two ways the algorithm can be wrong. I'll draw it out as a matrix so you can see it a bit more easily. In reality, we might put counts of false negatives, etc. in the matrix.


Actual
cat not cat
Prediction cat True Positive False Positive
not cat False Negative True Negative

All of this sounds great. It looks like we can define some rates and be done.  Let's start with some definitions and see where we get to.

We might want to know often we said it was a cat when it actually was a cat, in other words, when it actually was positive, how often did we say it was positive. This is called the True Positive Rate (TPR), which is defined like this (where FNR is the False Negative Rate and is similarly defined):

\[TPR = \frac{TP}{TP + FN} = 1 - FNR = sensitivity, recall, hit rate\]

On the flip side, how often did we say not cat when it really was not cat (how often did we say negative when it really was negative):

\[TNR = \frac{TN}{TN + FP} = 1 - FPR = specificity, selectivity\]

There are a whole bunch of other metrics we can similarly define and I won't belabor the point by defining them all here (it seems as if every possible combination of true/false positive/negative has a name). I'm just going to show some of them in this table to give you a flavor.


Actual Parameter
cat not cat
Prediction cat True Positive False Positive Precision (positive predictive value)
\[\frac{TP}{FP + TP}\]
False Discovery Rate
\[\frac{FP}{FP + TP}\]
not cat False Negative True Negative False Omission Rate
\[\frac{FN}{FN + TN}\]
Parameter True Positive Rate (Recall, Sensitivity)
\[\frac{TP}{TP + FN}\]
True Negative Rate (specificity)
\[\frac{TN}{TN + FP}\]
False Positive Rate \[\frac{FP}{TN + FP}\]

Be careful here; it's easy to get caught up on the names and definitions. You should focus on what this means for the correctness of your results.

We can use these metrics to help decide if our algorithms are good or not - but there are other things we need to consider.

Prevalence

One of the major issues in algorithmic bias has been prevalence. It's possible to get what seems like highly accurate results but for the results to be deeply biased by the underlying data. Again, the confusion matrix can help.

We can define the accuracy of an algorithm using this formula:

\[Accuracy = \frac{TP + TN}{TP + FP + TN + FN}\]

Let's imagine we're getting a really great accuracy. We're really good at saying it's a cat when it really is a cat. Doesn't this sound like a really great algorithm? Think about your answer before moving on.

The trouble is, it could be because almost all the underlying data is cat data. Imagine 95% of the data was cats and we said cat 100% of the time. Some of the metrics in the table would look wonderful. We'd get a 95% accuracy for example!

A version of this has happened in real life with awful consequences. Some of the human datasets that machine learning algorithms are trained on are biased: for example, they are disproportionally images of white people, or even worse, white males. In 2015, Google released a photo app that classified images. It misclassified pictures of black people as Gorillas. This is just horrendous on multiple levels. The problem here might be that their training data set didn't include many pictures of non-white people. The labeling algorithms were accurate, just so long as you're white.

To test for bias in the dataset, we look at a number called prevalence which represents the fraction of the data set that's in a category. In our example, the prevalence of cats would be 0.95 and non-cats 0.05, which reveals a huge bias towards cats. This might be OK if the site was aimed at cat-lovers, but not so great if the site was trying to grow non-cat sales.

If you're doing any machine learning work for public consumption, you must consider prevalence.

One number to bind them all

Precision, recall, false discovery rate... there are lots of numbers here and it gets confusing. Why don't we create one metric that binds them all together? We would like a score of 1 for this metric to represent perfection, and 0 to represent total failure. Fortunately, there is such a metric and it's called the \(F_1\) score.

I won't go into the derivation here, but I will give you the formula:

\[F_1 = \frac{TP}{TP + \frac{1}{2}(FP + FN)}\]

(for those of you who want a bit more, it's the harmonic mean of precision and recall). 

Even the \(F_1\) score isn't the end of it. It weighs precision and recall equally, but in reality, that might not be what we want. For example, we might consider a false positive much worse than a false negative (sending an innocent person to jail rather than setting a guilty person free for example). In these kinds of cases, there's a weighting factor \(\beta\) we can apply.

We can define \(\beta\) as:

\[\beta = \frac{TP + FP}{TP + FN}\] and we can create a revised F score as:

\[F_\beta =  \frac{(1 + \beta^2) TP}{(1 + \beta^2) TP + \beta^2FN + FP}\]

All this looks a bit familiar

By the way, there are very obvious parallels here to statistics, specifically, \(\alpha\), \(\beta\), Type I, and Type II errors. We're getting quite close to statistical tests with some of these processes, which probably isn't surprising. Sadly, similar things are called by different names in different disciplines, a nice way to keep barriers to entry high.

Snakes and pirates

Both Python and R have libraries you can use that will give you the confusion matrix and quantities derived from it. In Python, you should look at confusion_matrix in scikit-learn. In R, you need confusionMatrix from the caret package.

What's next?

The confusion matrix is just the start. There are a several techniques based on it that you can use to effectively evaluate algorithms. In a future blog post, I'm going to look at something called Receiver Operating Characteristic which has a very interesting history.  The thought I want to leave you with is a simple one: the confusion matrix is a means of representing different ways of being right and wrong. You can use quantities derived from the matrix to indicate bias and to indicate correctness.