How correct are my (machine learning) algorithms?

In machine learning, we're using algorithms to make predictions about outcomes based on input data. For example, given that a consumer at an online store views dog collars and dog leads, you might show them dog food if they search for 'pet food'. This is fairly obvious, but what if they then searched for evening wear, would you show them cocktail dresses or tuxedos?

(The confusion matrix can be confusing. Image source: Pixabay. Author: Erika Wittlieb. License: Pixabay license)

The confusion matrix is about quantifying the correctness of algorithms, but it's not sufficient of itself. Fortunately, there are quantities we can derive from the confusion matrix that will show up certain types of error as we'll see.

The confusion matrix

I'm going to use the example of an online store that sells pet products. Imagine an algorithm that tries to decide if a consumer has a cat or not. There are two ways the algorithm can be right and two ways the algorithm can be wrong. I'll draw it out as a matrix so you can see it a bit more easily. In reality, we might put counts of false negatives, etc. in the matrix.

		Actual
		cat	not cat
Prediction	cat	True Positive	False Positive
Prediction	not cat	False Negative	True Negative

All of this sounds great. It looks like we can define some rates and be done. Let's start with some definitions and see where we get to.

We might want to know often we said it was a cat when it actually was a cat, in other words, when it actually was positive, how often did we say it was positive. This is called the True Positive Rate (TPR), which is defined like this (where FNR is the False Negative Rate and is similarly defined):

\[TPR = \frac{TP}{TP + FN} = 1 - FNR = sensitivity, recall, hit rate\]

On the flip side, how often did we say not cat when it really was not cat (how often did we say negative when it really was negative):

\[TNR = \frac{TN}{TN + FP} = 1 - FPR = specificity, selectivity\]

There are a whole bunch of other metrics we can similarly define and I won't belabor the point by defining them all here (it seems as if every possible combination of true/false positive/negative has a name). I'm just going to show some of them in this table to give you a flavor.

		Actual		Parameter
		cat	not cat	Parameter
Prediction	cat	True Positive	False Positive	Precision (positive predictive value) \[\frac{TP}{FP + TP}\] False Discovery Rate \[\frac{FP}{FP + TP}\]
Prediction	not cat	False Negative	True Negative	False Omission Rate \[\frac{FN}{FN + TN}\]
Parameter		True Positive Rate (Recall, Sensitivity) \[\frac{TP}{TP + FN}\]	True Negative Rate (specificity) \[\frac{TN}{TN + FP}\] False Positive Rate \[\frac{FP}{TN + FP}\]

Be careful here; it's easy to get caught up on the names and definitions. You should focus on what this means for the correctness of your results.

We can use these metrics to help decide if our algorithms are good or not - but there are other things we need to consider.

Prevalence

One of the major issues in algorithmic bias has been prevalence. It's possible to get what seems like highly accurate results but for the results to be deeply biased by the underlying data. Again, the confusion matrix can help.

We can define the accuracy of an algorithm using this formula:

\[Accuracy = \frac{TP + TN}{TP + FP + TN + FN}\]

Let's imagine we're getting a really great accuracy. We're really good at saying it's a cat when it really is a cat. Doesn't this sound like a really great algorithm? Think about your answer before moving on.

The trouble is, it could be because almost all the underlying data is cat data. Imagine 95% of the data was cats and we said cat 100% of the time. Some of the metrics in the table would look wonderful. We'd get a 95% accuracy for example!

A version of this has happened in real life with awful consequences. Some of the human datasets that machine learning algorithms are trained on are biased: for example, they are disproportionally images of white people, or even worse, white males. In 2015, Google released a photo app that classified images. It misclassified pictures of black people as Gorillas. This is just horrendous on multiple levels. The problem here might be that their training data set didn't include many pictures of non-white people. The labeling algorithms were accurate, just so long as you're white.

To test for bias in the dataset, we look at a number called prevalence which represents the fraction of the data set that's in a category. In our example, the prevalence of cats would be 0.95 and non-cats 0.05, which reveals a huge bias towards cats. This might be OK if the site was aimed at cat lovers, but not so great if the site was trying to grow non-cat sales.

If you're doing any machine learning work for public consumption, you must consider prevalence.

One number to bind them all

Precision, recall, false discovery rate... there are lots of numbers here and it gets confusing. Why don't we create one metric that binds them all together? We would like a score of 1 for this metric to represent perfection, and 0 to represent total failure. Fortunately, there is such a metric and it's called the \(F_1\) score.

I won't go into the derivation here, but I will give you the formula:

\[F_1 = \frac{TP}{TP + \frac{1}{2}(FP + FN)}\]

(for those of you who want a bit more, it's the harmonic mean of precision and recall).

Even the \(F_1\) score isn't the end of it. It weighs precision and recall equally, but in reality, that might not be what we want. For example, we might consider a false positive much worse than a false negative (sending an innocent person to jail rather than setting a guilty person free for example). In these kinds of cases, there's a weighting factor \(\beta\) we can apply.

We can define \(\beta\) as:

\[\beta = \frac{TP + FP}{TP + FN}\] and we can create a revised F score as:

\[F_\beta = \frac{(1 + \beta^2) TP}{(1 + \beta^2) TP + \beta^2FN + FP}\]

All this looks a bit familiar

By the way, there are very obvious parallels here to statistics, specifically, \(\alpha\), \(\beta\), Type I, and Type II errors. We're getting quite close to statistical tests with some of these processes, which probably isn't surprising. Sadly, similar things are called by different names in different disciplines, a nice way to keep barriers to entry high.

Snakes and pirates

Both Python and R have libraries you can use that will give you the confusion matrix and quantities derived from it. In Python, you should look at confusion_matrix in scikit-learn. In R, you need confusionMatrix from the caret package.

What's next?

The confusion matrix is just the start. There are several techniques based on it that you can use to effectively evaluate algorithms. In a future blog post, I'm going to look at something called Receiver Operating Characteristic which has a very interesting history. The thought I want to leave you with is a simple one: the confusion matrix is a means of representing different ways of being right and wrong. You can use quantities derived from the matrix to indicate bias and to indicate correctness.

Engora Data Blog

Monday, November 9, 2020

Dazed and confused: the confusion matrix and getting it right and wrong