In a strange quirk of history, one of the ways of evaluating machine learning algorithms has its roots in World War II and was subsequently used in a range of disciplines, including psychiatry. Only much later was it used in machine learning, but it kept its original name: receiver operating characteristic (ROC). I'm going to look at the history of this technique and explain what it is and why it's so important.
Is it geese or is it enemy planes?
In 1940, the situation in Britain was dire; the country was engaged in a desperate stand against Hitler. To weaken the country, and break the will of the people, Nazi aircraft heavily bombed British cities, which was the infamous blitz. I've seen estimates of over 43,000 people killed and of course, there was huge damage to Britain's industrial and cultural infrastructure. Newsreel pictures and propaganda of the time give a view of the devastation. Britain stood alone against the Nazi threat; the Battle of Britain was an existential one.
It was vital therefore to detect enemy aircraft as quickly as possible, so the British government used a new technology called radar. Radar receivers had a number of settings, for example, you could turn the gain (amplification) up, but what should the correct settings be? Obviously, you want to correctly identify enemy aircraft, but you don't want to identify a flock of geese as aircraft. If you divert limited resources to chasing wild geese, those resources aren't available to pursue the real threat. This is where the receiver operating characteristic curve comes in. It was a way of deciding the best operating point and/or deciding the best receiver.
Ways of being right and wrong
I've covered this before in a previous blog post about the confusion matrix, so I'll just briefly recap here. There are two ways to be right and two ways to be wrong if we're doing a binary classification (geese/enemy aircraft).
|Prediction||enemy aircraft||True Positive||False Positive|
|geese||False Negative||True Negative|
From the counts of the True Positives, False Negatives, etc. we can define two quantities:
There are an overly large number of other quantities we can define to help us evaluate classification. But these quantities and numbers are points: they allow us to evaluate an algorithm at a point, or under a single operating condition.
A picture is worth a thousand words
The receiver operating characteristic is a plot of the True Positive Rate vs. the False Positive Rate for different settings. Generically, it looks something like this.
We get a curve by varying a parameter and measuring FNR and TPR at each of the parameter values. In the case of our World War II radar receiver, the parameter could be gain; increasing the gain changes the trade-off between TPR and FPR.
Let's imagine a receiver that was just a random selector - choosing geese or enemy aircraft based on the toss of a coin. We would expect it to give us a straight line at \(45^o\). Over time, the random selector would tend to the 50-50 point on the straight line. A real receiver has to do better than chance, so it has to be above the random line. In the chart below, the chance line is the black dotted line.
An ideal receiver has very different properties from the chance line. I've indicated an ideal operating curve in red on the chart below - it always gives a 100% True Positive Rate.
The ROC chart allows us to compare the behavior of different algorithms or different receivers. We could draw out the ROC curve for two receivers for example and choose the best one (the highest line). Here's a graphical representation.
A more mathematical way of doing the same thing is to use the ROC curves, but work out an area under the curve (AUC). An ideal receiver has an AUC of 1 (the red line), but obviously, the higher the AUC, the better.
Classifiers enable us to make categorical decisions based on input data. For example, if a user types 'evening wear' into a shopping site, do you show them cocktail dresses or tuxedos? A machine learning algorithm might use the users' browsing behavior to make a guess about male or female clothing. But how correct is the algorithm? This is where ROC curves can be used to understand the degree of correctness and the appropriate algorithmic settings to use.
Uses of ROC curves outside of machine learning
ROC curves are used in a wide range of disciplines:
- Medical test diagnosis - for example, studying the effect of BMI on health
- Psychiatry - placing people into normal and abnormal groups
- Chemical tests
- Dementia screening
- Radar development
More tongue-in-cheek, a group of medical researchers in Sydney, Australia used a ROC to find the optimal walking speed for men over 70 to avoid death. If you're interested, the optimal speed is 0.82m/s.
Limitations of ROC curves - precision-recall
In a previous blog post, I looked at the confusion matrix and talked about prevalence. The idea is simple: a biased data set can give you a false sense of the accuracy of your data. If your data is biased, a precision-recall plot may be more appropriate.
Going back to the confusion matrix, here's how we define precision and recall.
Here's a typical precision-recall curve.
Because precision gives us an indication of how relevant the results are, precision-recall curves are often used to evaluate information retrieval algorithms.
Despite the long track record for receiver operating characteristic curves, precision-recall curves may be a better evaluation method. However, old habits die hard and ROC curves still reign.
Don't lose sight of the end goal
ROC and precision-recall curves are all about the same thing: figuring out how useful an algorithm is. There are lots of different ways an algorithm can be wrong, which means different ways of investigating correctness. Don't lose sight of the fact that under the hood, machine learning algorithms are probabilistic.