Monday, November 16, 2020

Geese or enemy aircraft? Receiver Operating Characteristic curves in machine learning

In a strange quirk of history, one of the ways of evaluating machine learning algorithms has its roots in World War II and was subsequently used in a range of disciplines, including psychiatry. Only much later was it used in machine learning, but it kept its original name: receiver operating characteristic (ROC). I'm going to look at the history of this technique and explain what it is and why it's so important.

Is it geese or is it enemy planes?

In 1940, the situation in Britain was dire; the country was engaged in a desperate stand against Hitler.  To weaken the country, and break the will of the people, Nazi aircraft heavily bombed British cities, which was the infamous blitz. I've seen estimates of over 43,000 people killed and of course, there was huge damage to Britain's industrial and cultural infrastructure. Newsreel pictures and propaganda of the time give a view of the devastation. Britain stood alone against the Nazi threat; the Battle of Britain was an existential one.

(Office workers in London going to work through bomb damage. Image source: Wikimedia Commons, License: Public Domain.)

It was vital therefore to detect enemy aircraft as quickly as possible, so the British government used a new technology called radar. Radar receivers had a number of settings, for example, you could turn the gain (amplification) up, but what should the correct settings be? Obviously, you want to correctly identify enemy aircraft, but you don't want to identify a flock of geese as aircraft. If you divert limited resources to chasing wild geese, those resources aren't available to pursue the real threat. This is where the receiver operating characteristic curve comes in. It was a way of deciding the best operating point and/or deciding the best receiver.

Ways of being right and wrong

I've covered this before in a previous blog post about the confusion matrix, so I'll just briefly recap here. There are two ways to be right and two ways to be wrong if we're doing a binary classification (geese/enemy aircraft).

enemy aircraft geese
Prediction enemy aircraft True Positive False Positive
geese False Negative True Negative

From the counts of the True Positives, False Negatives, etc. we can define two quantities:

\[TPR = \frac{TP}{TP + FN} = 1 - FNR\]
\[= True \ Positive \ Rate, sensitivity, recall, hit rate\]
\[FPR = \frac{FP}{FP + TN} = False \ Positive \ Rate, fall out\]

There are an overly large number of other quantities we can define to help us evaluate classification. But these quantities and numbers are points: they allow us to evaluate an algorithm at a point, or under a single operating condition.

A picture is worth a thousand words

The receiver operating characteristic is a plot of the True Positive Rate vs. the False Positive Rate for different settings. Generically, it looks something like this. 

We get a curve by varying a parameter and measuring FNR and TPR at each of the parameter values. In the case of our World War II radar receiver, the parameter could be gain; increasing the gain changes the trade-off between TPR and FPR. 

Let's imagine a receiver that was just a random selector - choosing geese or enemy aircraft based on the toss of a coin. We would expect it to give us a straight line at \(45^o\). Over time, the random selector would tend to the 50-50 point on the straight line. A real receiver has to do better than chance, so it has to be above the random line. In the chart below, the chance line is the black dotted line.

An ideal receiver has very different properties from the chance line. I've indicated an ideal operating curve in red on the chart below - it always gives a 100% True Positive Rate.

The ROC chart allows us to compare the behavior of different algorithms or different receivers. We could draw out the ROC curve for two receivers for example and choose the best one (the highest line). Here's a graphical representation.

A more mathematical way of doing the same thing is to use the ROC curves, but work out an area under the curve (AUC). An ideal receiver has an AUC of 1 (the red line), but obviously, the higher the AUC, the better.

Machine learning

Classifiers enable us to make categorical decisions based on input data. For example, if a user types 'evening wear' into a shopping site, do you show them cocktail dresses or tuxedos? A machine learning algorithm might use the users' browsing behavior to make a guess about male or female clothing. But how correct is the algorithm? This is where ROC curves can be used to understand the degree of correctness and the appropriate algorithmic settings to use.

Uses of ROC curves outside of machine learning

ROC curves are used in a wide range of disciplines:

More tongue-in-cheek, a group of medical researchers in Sydney, Australia used a ROC to find the optimal walking speed for men over 70 to avoid death. If you're interested, the optimal speed is 0.82m/s. 

Limitations of ROC curves - precision-recall

In a previous blog post, I looked at the confusion matrix and talked about prevalence. The idea is simple: a biased data set can give you a false sense of the accuracy of your data. If your data is biased, a precision-recall plot may be more appropriate.

Going back to the confusion matrix, here's how we define precision and recall.

\[Precision = \frac{TP}{TP + FP}\]
\[Recall = \frac{TP}{TP + FN}\]

Here's a typical precision-recall curve.

Because precision gives us an indication of how relevant the results are, precision-recall curves are often used to evaluate information retrieval algorithms.

Despite the long track record for receiver operating characteristic curves, precision-recall curves may be a better evaluation method. However, old habits die hard and ROC curves still reign.

Don't lose sight of the end goal

ROC and precision-recall curves are all about the same thing: figuring out how useful an algorithm is. There are lots of different ways an algorithm can be wrong, which means different ways of investigating correctness. Don't lose sight of the fact that under the hood, machine learning algorithms are probabilistic.

Reading more

Monday, November 9, 2020

Dazed and confused: the confusion matrix and getting it right and wrong

How correct are my (machine learning) algorithms?

In machine learning, we're using algorithms to make predictions about outcomes based on input data. For example, given that a consumer at an online store views dog collars and dog leads, you might show them dog food if they search for 'pet food'. This is fairly obvious, but what if they then searched for evening wear, would you show them cocktail dresses or tuxedos? 

(The confusion matrix can be confusing. Image source: Pixabay. Author: Erika Wittlieb. License: Pixabay license)

The confusion matrix is about quantifying the correctness of algorithms, but it's not sufficient of itself. Fortunately, there are quantities we can derive from the confusion matrix that will show up certain types of error as we'll see.

The confusion matrix

I'm going to use the example of an online store that sells pet products. Imagine an algorithm that tries to decide if a consumer has a cat or not. There are two ways the algorithm can be right and two ways the algorithm can be wrong. I'll draw it out as a matrix so you can see it a bit more easily. In reality, we might put counts of false negatives, etc. in the matrix.

cat not cat
Prediction cat True Positive False Positive
not cat False Negative True Negative

All of this sounds great. It looks like we can define some rates and be done.  Let's start with some definitions and see where we get to.

We might want to know often we said it was a cat when it actually was a cat, in other words, when it actually was positive, how often did we say it was positive. This is called the True Positive Rate (TPR), which is defined like this (where FNR is the False Negative Rate and is similarly defined):

\[TPR = \frac{TP}{TP + FN} = 1 - FNR = sensitivity, recall, hit rate\]

On the flip side, how often did we say not cat when it really was not cat (how often did we say negative when it really was negative):

\[TNR = \frac{TN}{TN + FP} = 1 - FPR = specificity, selectivity\]

There are a whole bunch of other metrics we can similarly define and I won't belabor the point by defining them all here (it seems as if every possible combination of true/false positive/negative has a name). I'm just going to show some of them in this table to give you a flavor.

Actual Parameter
cat not cat
Prediction cat True Positive False Positive Precision (positive predictive value)
\[\frac{TP}{FP + TP}\]
False Discovery Rate
\[\frac{FP}{FP + TP}\]
not cat False Negative True Negative False Omission Rate
\[\frac{FN}{FN + TN}\]
Parameter True Positive Rate (Recall, Sensitivity)
\[\frac{TP}{TP + FN}\]
True Negative Rate (specificity)
\[\frac{TN}{TN + FP}\]
False Positive Rate \[\frac{FP}{TN + FP}\]

Be careful here; it's easy to get caught up on the names and definitions. You should focus on what this means for the correctness of your results.

We can use these metrics to help decide if our algorithms are good or not - but there are other things we need to consider.


One of the major issues in algorithmic bias has been prevalence. It's possible to get what seems like highly accurate results but for the results to be deeply biased by the underlying data. Again, the confusion matrix can help.

We can define the accuracy of an algorithm using this formula:

\[Accuracy = \frac{TP + TN}{TP + FP + TN + FN}\]

Let's imagine we're getting a really great accuracy. We're really good at saying it's a cat when it really is a cat. Doesn't this sound like a really great algorithm? Think about your answer before moving on.

The trouble is, it could be because almost all the underlying data is cat data. Imagine 95% of the data was cats and we said cat 100% of the time. Some of the metrics in the table would look wonderful. We'd get a 95% accuracy for example!

A version of this has happened in real life with awful consequences. Some of the human datasets that machine learning algorithms are trained on are biased: for example, they are disproportionally images of white people, or even worse, white males. In 2015, Google released a photo app that classified images. It misclassified pictures of black people as Gorillas. This is just horrendous on multiple levels. The problem here might be that their training data set didn't include many pictures of non-white people. The labeling algorithms were accurate, just so long as you're white.

To test for bias in the dataset, we look at a number called prevalence which represents the fraction of the data set that's in a category. In our example, the prevalence of cats would be 0.95 and non-cats 0.05, which reveals a huge bias towards cats. This might be OK if the site was aimed at cat lovers, but not so great if the site was trying to grow non-cat sales.

If you're doing any machine learning work for public consumption, you must consider prevalence.

One number to bind them all

Precision, recall, false discovery rate... there are lots of numbers here and it gets confusing. Why don't we create one metric that binds them all together? We would like a score of 1 for this metric to represent perfection, and 0 to represent total failure. Fortunately, there is such a metric and it's called the \(F_1\) score.

I won't go into the derivation here, but I will give you the formula:

\[F_1 = \frac{TP}{TP + \frac{1}{2}(FP + FN)}\]

(for those of you who want a bit more, it's the harmonic mean of precision and recall). 

Even the \(F_1\) score isn't the end of it. It weighs precision and recall equally, but in reality, that might not be what we want. For example, we might consider a false positive much worse than a false negative (sending an innocent person to jail rather than setting a guilty person free for example). In these kinds of cases, there's a weighting factor \(\beta\) we can apply.

We can define \(\beta\) as:

\[\beta = \frac{TP + FP}{TP + FN}\] and we can create a revised F score as:

\[F_\beta =  \frac{(1 + \beta^2) TP}{(1 + \beta^2) TP + \beta^2FN + FP}\]

All this looks a bit familiar

By the way, there are very obvious parallels here to statistics, specifically, \(\alpha\), \(\beta\), Type I, and Type II errors. We're getting quite close to statistical tests with some of these processes, which probably isn't surprising. Sadly, similar things are called by different names in different disciplines, a nice way to keep barriers to entry high.

Snakes and pirates

Both Python and R have libraries you can use that will give you the confusion matrix and quantities derived from it. In Python, you should look at confusion_matrix in scikit-learn. In R, you need confusionMatrix from the caret package.

What's next?

The confusion matrix is just the start. There are several techniques based on it that you can use to effectively evaluate algorithms. In a future blog post, I'm going to look at something called Receiver Operating Characteristic which has a very interesting history.  The thought I want to leave you with is a simple one: the confusion matrix is a means of representing different ways of being right and wrong. You can use quantities derived from the matrix to indicate bias and to indicate correctness.

Monday, November 2, 2020

The null hypothesis test

What's null hypothesis testing?

In business, as in many other fields, we have to make decisions in the face of uncertainty. Does this technology improve conversion? Is the new sales process working? Is the new machine tool improving quality? Almost never are the answers to these questions absolutely certain; there will be probabilities we have to trade off to make our decision.

(Two hypotheses battling it out for supremacy. Image source: Wikimedia Commons. Author: Pierdante Romei. License: Creative Commons.)

Null hypothesis tests are a set of techniques that enable us to reach probabilistic conclusions in an unbiased way. They provide a level playing field to decide if an effect is there or not.

Although null hypothesis tests are widely taught in statistics classes, many people who've come into data science from other disciplines aren't familiar with the core ideas. Similarly, people with business backgrounds sometimes end up evaluating A/B tests where the correct interpretation of null hypothesis tests is critical to understanding what's going on. 

I’m going to explain to you what null hypothesis testing is and some of the concepts needed to implement and understand it.

What result are you testing for?

To put it simply, a null hypothesis test is a test of whether there is an effect of a certain size present or not. The null hypothesis is that there is no effect, and the alternate hypothesis is that there is an effect. 

At its heart, the test is about probability and not certainty. We can’t say for sure if there is an effect or not, what we can say is the probability of there being an effect.  But probabilities are limited and we have to make binary go/no-go decisions - so null hypothesis tests include the idea of probability thresholds for deciding whether something is there or not.

To illustrate the use of a null hypothesis test, I’m going to use a famous example, that of the lady tasting tea. 

In a research lab, there was a woman who claimed she could tell the difference between cups of tea prepared in one of two ways:

  • The milk poured into the cup first and then the tea poured in
  • The tea poured first and then the milk poured in.
(Image source: Wikimedia Commons Artist: Ian Smith License: Creative Commons)

The researcher decided to do a test of her abilities by asking her to taste multiple cups of tea and state how she thought each cup had been prepared. Of course, it’s possible she could be 100% successful by chance alone. 

We can set up a null hypothesis test using these hypotheses:

  • The null hypothesis is the most conservative option. Here it’s that she can’t taste the difference. More specifically, her success rate is indistinguishable from random chance.
  • The alternative hypothesis is that she can tell the difference. More specifically, her success rate is significantly different from random chance.
Let's define some quantities:
  • \( p_T \) - the proportion of cups of tea she correctly got
  • \( p_C \)  - the proportion of cups of tea she would be expected to get by chance alone (by guessing)
We can write the null and alternative hypotheses as:

  • \( H_0: p_T = p_C\) 
  • \( H_1: p_T \neq p_C\) 

But – the hypotheses in this form aren't enough. Will we insist she has to be correct every single time? Is there some threshold we expect her to reach before we accept her claim?

The null hypothesis is the first step in setting up a statistical test, but to make it useful, we have to go a step further and set up thresholds. To do this, we have to understand different types of errors.

Error types

To make things easy, we’ll call 'milk first' a positive and 'milk second' a negative.

For our lady testing tea, there are four possibilities:

  • She can say ‘milk first’ when it was 'milk first' – a true positive
  • She can say ‘milk first’ when it wasn’t 'milk first' – a false positive (also known as a Type I error)
  • She can say ‘milk second’ when it was 'milk second' – a true negative
  • She can say ‘milk second’ when it wasn’t 'milk second' – a false negative (also known as a Type II error)

This is usually expressed as a table like the one below.

    Null Hypothesis is
    True False
Decision about null hypothesis  Fail to reject True negative
Correct inference
Probability threshold= 1 - \( \alpha \)
False negative
Type II error
Probability threshold= \( \beta \)
Reject False positive
Type I error
Probability threshold = \( \alpha \)
True positive
Correct inference
Probability threshold = Power = 1 - \( \beta \)

We can assign probabilities to each of these outcomes. As you can see, there are two numbers that are important here, \(\alpha\) and \(\beta\); however, in practice, we consider \(\alpha\) and 1-\(\beta\) as the numbers of importance. \(\alpha\) is called significance, and 1-\(\beta\) is called power. We can set values for each of them prior to the test. By convention, \(\alpha\) is usually 0.05, and 1-\(\beta \geq \) 0.80.

Test results, test size, and p-values

Our lady could guess correctly by chance alone. We have to set up the test so a positive conclusion due to randomness is unlikely, hence the use of thresholds. The easiest way to do this is to set the test size correctly, i.e. set the number of cups of tea. Through some math I won't go into, we can use \(\alpha\), (1-\(\beta\)), and the effect size to set the sample size. The effect size, in this case, is her ability to detect how the cup of tea was prepared above and beyond what would be expected by chance. For example, we might run a test to see if she was 20% better than chance.

To evaluate the test, we calculate a p-value from the test results. The p-value is the probability the test result was due to chance alone. Because this is so important, I'm going to explain it again using other words. Let's imagine the lady tasting tea was guessing. By guessing alone, she could get between 0% and 100% correct. We know the probability for each percentage. We know it's very unlikely she'll get 100% or 0% by guesswork, but more likely she'll get 50%. For the score she got, we can work out the probability of her getting this score (or higher) through chance alone. Let's say there was a 3% chance she could have gotten her score by guessing alone. Is this proof she's not guessing?

We compare the p-value to our \( \alpha\) threshold to decide which hypothesis is wrong. Let’s say our p-value was 0.03 and our \( \alpha \) value was 0.05, because 0.03 < 0.05 we reject the null hypothesis. In other words, we would accept that the lady was not guessing.

False negatives, false positives

Using \(\alpha\) and a p-value, we can work out the chance of us saying there's an effect when there is none (a false positive). But what about a false negative? We could say there's no effect when there really is one. That might be as damaging to a business as a false positive. The quantity \(\beta\) gives us the probability of a false negative. By convention, statisticians talk about the power (1-\(\beta\)) of a test which is the probability of detecting an effect of the size you think is there.

Single tail or two-tail tests

Technically, the way the null hypothesis is set up in the case of the lady tasting tea is a two-tailed test. To ‘succeed ’ she has to do a lot better than chance or she has to do a lot worse. That’s appropriate in this case because we’re trying to understand if she’s doing something else other than guessing.

We could set up the test differently so she has to only be right more often than chance suggests. This would be a one-tail test. One-tail tests are shorter than two-tail tests, but they’re more limited. 

In business, we tend to do two-tailed tests rather than one-tailed tests.

Fail to reject the null or rejecting the null

Remember, we’re talking about probabilities and not certainties. Even if we gave our lady 100 cups to taste, there’s still a possibility she gets them all right due to chance alone. So we can’t say either the null or the alternate is true, all we can do is reject them at some threshold, or fail to reject them. In the case of a p-value of 0.03, a statistician wouldn’t say the alternate is true (the lady can taste the difference), but they would say ‘we reject the null hypothesis’. If the p-value was 0.1, it would be higher than the \( \alpha \) value and we would ‘fail to reject the null hypothesis’. This language is complex, but statisticians are trying to capture the idea that results are about probabilities, not certainties.

Choice of significance and power

Significance and power affect test size, so maybe we should choose them to make the test short? If you want to do a valid test, you're not free to choose any values of \(\alpha\) and (1-\(\beta\)) you choose. Convention dictates that you stick to these ranges:

  • \(\alpha \geq 0.95\) - anything less than this is usually considered a junk test.
  • (1-\(\beta) \geq 0.8\) - anything less than this is not worth doing. 

The why behind these values is the subject of another blog post.

The null hypothesis test summarized

This has been a very high-level summary of what happens in a null hypothesis test, for the sake of simplicity there are several steps I've left out and I've greatly summarized some ideas. Here's a simple summary of the steps I've discussed.

  1. Decide if the test is one-tail or two-tail.
  2. Create a null and alternate hypothesis.
  3. Set values for \(\alpha\) and (1-\(\beta\)) prior to the test.
  4. After the test, calculate a p-value.
  5. Compare the p-value to \(\alpha\) to figure out a false positive probability
  6. Check \(\beta\) to figure out the probability of a false negative.

I've left out topics like the z-test and the t-test and a bunch of other important ideas. 

Your takeaway should be that this process is complex and there are no shortcuts. At its heart, hypothesis testing is about deciding what's true when the data is uncertain and you need to do it without bias.

(Justice is supposed to be blind and balanced - like a null hypothesis test. Image source: Wikimedia Commons. License:  GNU Free Documentation License.)

Problems with the null hypothesis test

Mathematically, there's controversy about the fundamentals of the procedure, but frankly, the controversy is too complex to discuss here - in any case, the controversy isn't over whether the procedures work or not.

A more serious problem is baked into the approach. At its heart, null hypothesis testing is about making a binary yes/no decision based on probabilistic data. The results are never certain. Unfortunately, test results are often taken as certain. For example, if we can't detect an effect in a test, it's often assumed there is no effect, but that's not true. This assumption that no detection = no effect has had tragic consequences in medical trials; there are high-profile cases where the negative side effects of a drug have been just below the threshold levels. Sadly, once the drugs have been released, the negative effects become well know with disastrous consequences, a good example being Vioxx.

You must be aware that a test failure doesn't mean there isn't an effect. It could mean there's an effect hovering just below your acceptance threshold.

Using the null hypothesis in business

This is all a bit abstract, so let's bring it back to business. What are some examples of null hypothesis tests in the business world?

A/B testing

Most of the time, we choose a two-tail test because we're interested in the possibility a change might make conversion or other metrics worse. The hypothesis test we use is usually of this form:

\(H_0 : CR_B = CR_A\)

\(H_1 : CR_B \neq CR_A\)

where CR is the conversion rate, or revenue per user per branch, or add to bag etc.

Manufacturing defects

Typically, these tests are one-tailed because we're only interested in an improvement. Here, the test might be:

\(H_0 : DR_B = DR_A\)

\(H_1 : DR_B < DR_A\)

where DR is the defect rate.

Closing thoughts

If all this seems a bit complex, arbitrary, and dependent on conventions, you're not alone. As it turns out, null hypothesis techniques are based on the shotgun marriage of two separate approaches to statistics. In a future blog post, I'll delve into this some more. 

For now, here's what you should take away:
  • You should understand that you need education and training to run these kinds of tests. A good grounding in statistics is vital.
  • The results are probabilistic and not certain. A negative test doesn't mean an effect isn't there, it might just be hovering underneath the threshold of detection.

Reading more

Saturday, October 24, 2020

Frankenstein, vampire, and volcano: dinner at Lake Geneva

Sometimes, there are events that ripple through history and have effects hundreds of years later. I'm sure you're thinking of battles, or assassinations, or elections, or something noisy or violent. But smaller and more peaceful events can have big impacts; even something as innocuous as a single dinner party can change the world. We're approaching Halloween, so I'm going to tell you how a dinner party over two hundred years ago gave us two iconic horror legends and how brilliant people can have an impact on the world that outlives them. Let's start with who was at this dinner party.

(Frankenstein's Monster and Dracula. Image credit: Wikimedia Commons. License: public domain)

The players

Lord Byron. 'Mad, bad, and dangerous to know.' Lord Byron is regarded as one of the leading English poets and his poetry is still widely read today. To say his life was full is something of an understatement; he was at times a theater director, poet, revolutionary, political radical, and a sexual adventurer. As we'll see, some of his behavior was quite shocking, even by modern standards. By 1816, his 'relationship' escapades made England too uncomfortable for him, so he left.

Percy Bysshe Shelley. An English romantic poet, still regarded as one of the country's finest. Shelley was a political radical and didn't follow the social codes of the day. Although his behavior was considered scandalous, it wasn't at Byron's level of hedonism. By 1816, Shelley had left his wife and was in a relationship with Mary Godwin (later, Mary Shelley).

Mary Shelley. The daughter of the early feminist and radical Mary Wollstonecraft and the political philosopher William Godwin. Despite this illustrious heritage, she received little in the way of formal education. When she was 17, she fell in love with Shelley and ran off with him to Europe. In February 1815, Mary gave birth to a baby girl (Shelley's daughter), but the child died soon after birth.

Claire Claremont. Mary Shelley's half-sister. She had a more formal education than Mary, including the ability to speak French (which she used to aid Mary and Percy Shelley's initial trip to Europe). She was pursuing an affair with Byron which was cooling by the time of the events I'm going to describe.

John Polidori. Byron's personal physician and only 20 years old during the trip and dinner party. By all accounts, Byron treated Polidori with contempt and constantly belittled him. Polidori had large gambling debts and was secretly in the pay of Byron's publisher, who gave him a £500 advance to keep notes on what went on; they were hoping for some salacious gossip. Polidori was also interested in Mary, who was not interested in him.

The volcano and its aftermath - the 'year without a summer'

In 1815, the volcano at Mount Tambora erupted. This was the largest volcanic eruption in modern times; it was heard 2,600 km away and pumped 41 cubic km of dust 43 km high into the atmosphere.

The huge amount of atmospheric dust had a dramatic impact; it reflected sunlight which resulted in global cooling. The loss of sunlight led to people calling 1816, 'the year without a summer', but it was worse than just bad summer holidays; crop failures led to famines worldwide which in turn led to political upheaval. Atmospheric dust gave spectacular sunsets which were captured by artists at the time.

The Napoleonic wars

Europe in the early part of the 19th century had convulsed with war. Napoleon has run hugely successful military campaigns across the continent, leaving devastation behind him. In 1815, there was a final battle for supremacy, with Napoleon on one side and a 'coalition of the willing' on the other - this was the famous battle of Waterloo. Napoleon was finally defeated, but at a huge cost. The people and infrastructure of continental Europe suffered the ravages of war.

Europe needed to recover, and that takes some good fortune and time. Unfortunately, the after-effects of the volcano led to crop failures and famines just at the time when good conditions were needed. 

Culturally, there was a strong feeling of the end of days; war, crop failures, wild weather, and outstanding sunsets. These were not normal times.

The Lake Geneva holiday

Our five adventurers had decided on taking a European vacation together. They'd traveled across Europe and met up in Lake Geneva, Switzerland, where they rented adjoining properties. The original intent was some pleasurable diversions like boating and sightseeing, but the miserable conditions meant they had to stay inside. Instead of warm, bright, summer evenings, they had instead conditions more like winter.

Tensions were high between the five of them. Claire was still pursuing Byron, who wasn't interested except when he was. Byron was busy demeaning and belittling Polidori. Polidori was chasing Mary who wasn't interested in being chased. To add to the fun, Shelley was trying hard to impress Byron. Of course, a general dread hung heavy in the air: everyone knew about war, crop failures, and political upheaval.

As you might expect with such a group of people, the conversation was wide-ranging, varying from folklore to politics to science. Shelley and Byron were prone to flights of fancy in their discussions; once, when Byron was reading a ghost story, Shelley imagined a woman with eyes in place of nipples and ran screaming from the room. Polidori was more scientific, and with him they spoke of Galvini's experiments on frog's legs, making them kick by applying electricity; had Galvani discovered the life force? Of course, there were also discussions of the latest political ideas and the concept of free will. All in all, a heady atmosphere.

One night, after a reading of ghost stories, they decided on a contest; who could write the best horror story? Of course, the expectation was that Byron and Shelley would win, but that's not what happened. The following morning, Byron had a so-so effort, John Polidori had something better, and the best of all came from the least experienced writer; Mary Shelley.

The birth of Frankenstein

The ideas coalesced in Mary's head: free will, animating life force, the desire for love, ghost stories, wild human behavior, and the gothic feel of central Europe. She created a scientist, Frankenstein, who ignores society's moral code to play God and create life itself. The monster he created was never named in the book, but it's telling that our sympathies are with the poor, mistreated person; a monster in appearance, but an articulate feeling being.  Shelley tells a story of the creator's neglect of his creation and its effect on the monster, of how the monster has to educate himself, and how he later comes looking for his maker to create a partner for him; the monster is looking for love. We might even say the monster wants life, liberty, and the pursuit of happiness. 

The group declared Mary the winner and encouraged her to publish her story, which she did after the Shelleys return to England.

Mary Shelley published the novel anonymously; it became a best-seller but received mixed reviews from literary critics. Once she was known as the author, some reviewers speculated that it might have been written by Percy Shelley rather than Mary. This seems like typical nineteenth-century sexism at first, but bear in mind that Percy Shelley was a well-known writer at the time and Mary was not. It does seem likely that she had some help with editing and maybe with writing suggestions, and why not when she had world-class writing talent on tap? The consensus today is that she was indeed the author.

The birth of vampires

Polidori's story was altogether different. He imagined a vampire, but not the vampire creatures of old, which were ugly, inhuman creatures. His vampire was a man, but a man who was physically attractive, deeply manipulative, and preyed on women - and a lord as well. His role model was obviously Byron himself ('mad, bad, and dangerous to know'). This story, 'Vampyre', is credited with creating the elements of the modern vampire legend and was one of the inspirations for Bram Stoker's 'Dracula' 70 years later.

'Vampyre' was published without Polidori's permission and was initially credited to Byron, though both Byron and Polidori later claimed it was Polidori's work.

The aftermath

What happened after that fateful evening?

Claire Claremont gave birth to Byron's daughter, but at the time as a single mother, she needed Byron to acknowledge and protect her child. Byron did acknowledge the child as his, and took their daughter into his 'care', he had Clare hand the baby over to him. He gave his daughter into the care of nuns in Italy and ignored the child for the rest of her life. Byron didn't see her again and he forbade Claire from doing so too. Their child died in Italy at the age of 5. Claire later said of Byron that he gave her a few minutes of pleasure, but a lifetime of trouble.

Byron never returned to England, he moved around Europe in search of entertainment and engagement, eventually fighting in the Greek war of independence (from the Ottoman empire). He died at age 36 from sepsis while preparing to fight for Greece. 

(As an aside, Byron already had a daughter from his wife (he was still married to her during the events of 1816) - whom he also ignored. His wife wanted nothing to do with Byron's wild extravagant ways and educated their daughter in science and mathematics.  Their daughter's name was Ada Lovelace, of computing fame.)

A little later after the dinner part, Byron fired Polidori who returned to London. Polidori didn't enjoy the success he thought he deserved and it all become too much for him. At the age of 25, he committed suicide by drinking cyanide.

Shelley continued writing poetry. In December 1816, the body of Shelley's wife was discovered floating in the Serpentine in London. Now free to marry, he married Mary just a few weeks later. At the age of 29, he went sailing on the Gulf of La Spezia, Italy and died during a storm.

After Percy Shelley's death, Mary Shelley became a professional writer to support herself and her son.  Notably, she wrote more horror fiction, including The Last Man, the first dystopian science fiction novel. She died of a brain tumor at age 53.

The echoes of history

Of course, vampires and Frankenstein's monster live on to this day. There have been numerous books, plays, comics, TV series, and movies featuring one or both of them. In a few days' time, children impersonating them will knock on my door, I'll give them candy, and I'll think of how it all started in Lake Geneva over two hundred years ago.

Monday, October 19, 2020

Stylish Pandas in the frame

The data can't be right, it's so ugly

Despite what many technical people want to believe, well-presented data is more convincing than badly presented data. Unfortunately, the default way Pandas outputs dataframes as tables is ugly. I'm going to show you how to make Pandas dataframes (tables) very pretty and hopefully more convincing.

(A very attractive panda. Image source: Wikimedia Commons. Author: Christian Mehlf├╝hrer. License: Creative Commons.)

Ugly Betty

My dataset is the results of the 2019 UK general election: the number of MPs and voters per party. Here's my Pandas dataframe (I've called it parliament for some reason):

                           party  MPs     votes  MPs frac  votes frac

0                   Conservative  365  13966565  0.561538    0.452447

1                         Labour  202  10269076  0.310769    0.332667

2           Scottish Nationalist   48   1242380  0.073846    0.040247

3              Liberal Democrats   11   3696423  0.016923    0.119746

4           Democratic Unionists    8    244128  0.012308    0.007909

5                      Sinn Fein    7    181853  0.010769    0.005891

6                    Plaid Cymru    4    153265  0.006154    0.004965

7   Social Democratic and Labour    2    118737  0.003077    0.003846

8                          Green    1    835579  0.001538    0.027069

9                       Alliance    1    134115  0.001538    0.004345

10                       Speaker    1     26831  0.001538    0.000869

If we output the dataframe to HTML using parliament.to_html(), here's what we get by default. It looks amateurish. Let's make it nicer.

partyMPsvotesMPs fracvotes frac
2Scottish Nationalist4812423800.0738460.040247
3Liberal Democrats1136964230.0169230.119746
4Democratic Unionists82441280.0123080.007909
5Sinn Fein71818530.0107690.005891
6Plaid Cymru41532650.0061540.004965
7Social Democratic and Labour21187370.0030770.003846

Adding style

Pandas dataframes have a style property we can use to customize the appearance of the dataframe and its HTML rendering too. The style property returns a Styler object we can use to make changes to the way the data is rendered as a HTML table.  I'm going to add style and show you what the rendered HTML looks like.

Precision, thousands, and hiding the index

The fraction of votes and MPs has six decimal places, which is the default for Python formatting. Let's change the fractional numbers to three decimal places, introduce thousand separators for the number of votes, and hide the index. Here's the code to do it:
    {"MPs frac":"{:.3f}",
     "votes frac":"{:.3f}",
     "votes": "{:,}"}

In this case, the style.format code takes a dict argument. The dict keys are the dataframe column names and the dict values are the formatting instructions. Most Python formatters work with this method but some don't; for example, the alignment Python formatters don't work. Here's what the rest of the code means:

  • {:.3f} truncates the floating point numbers to three decimal places
  • {:,} introduces the thousand separator
  • hide_index hides the index
  • render renders the table using HTML - it produces a string output of HTML text

The arguments to format don't have to be a dict, but using a dict makes it easier if you're changing several columns at once.

Here's the HTML output from the code above. It's a big improvement, but not quite what we want.

party MPs votes MPs frac votes frac
Conservative 365 13,966,565 0.562 0.452
Labour 202 10,269,076 0.311 0.333
Scottish Nationalist 48 1,242,380 0.074 0.040
Liberal Democrats 11 3,696,423 0.017 0.120
Democratic Unionists 8 244,128 0.012 0.008
Sinn Fein 7 181,853 0.011 0.006
Plaid Cymru 4 153,265 0.006 0.005
Social Democratic and Labour 2 118,737 0.003 0.004
Green 1 835,579 0.002 0.027
Alliance 1 134,115 0.002 0.004
Speaker 1 26,831 0.002 0.001

Column alignment and spacing

Let's right-align the columns and add a bit more spacing between columns.
    {"MPs frac":"{:.3f}",
     "votes frac":"{:.3f}",
     "votes": "{:,}"})
    .set_properties(**{'text-align': 'right',
                       'padding':'0 15px'})

The set_properties method sets the CSS properties of the HTML object, in this case, the table. 

Here's the output:

party MPs votes MPs frac votes frac
Conservative 365 13,966,565 0.562 0.452
Labour 202 10,269,076 0.311 0.333
Scottish Nationalist 48 1,242,380 0.074 0.040
Liberal Democrats 11 3,696,423 0.017 0.120
Democratic Unionists 8 244,128 0.012 0.008
Sinn Fein 7 181,853 0.011 0.006
Plaid Cymru 4 153,265 0.006 0.005
Social Democratic and Labour 2 118,737 0.003 0.004
Green 1 835,579 0.002 0.027
Alliance 1 134,115 0.002 0.004
Speaker 1 26,831 0.002 0.001


The political parties have colors, so it would be nice to show their party colors as a background to their names, meaning we should change the background colors of the party column. It might also be nice to highlight the maximum results in a light gray. Maybe we can get really clever and add a bar chart representing the number of seats won. Here's the code to do all that:

styles = [dict(selector='.col1', 
               props=[('width', '50px')])]

def colors(value):
    partymap = {'Conservative': 'lightblue',
                'Labour': 'salmon',
                'Scottish Nationalist' : 'yellow',
                'Liberal Democrats': 'orange',
                'Democratic Unionists': 'orange',
                'Sinn Fein': 'lightgreen' ,
                'Plaid Cymru': 'lightgreen',
                'Social Democratic and Labour': 'salmon',
                'Green' : 'lightgreen',
                'Alliance': 'orange',
                'Speaker': 'lightgray'}
    return """background-color: {0}""".format(
            {"MPs frac":"{:.3f}",
             "votes frac":"{:.3f}",
             "votes": "{:,}"})
          .set_properties(**{'text-align': 'right',
                             'padding':'0 15px'})
          .bar(subset=['MPs'], color='lightgray')

Here's what this code does:

  • bar takes the data in the column and draws a bar chart based on it. It uses the full width of the column and expands the column if necessary, hence my need to restrict the column width to get the table to fit on the Blogger page correctly.
  • The party background colors are created with applymap method using the colors function applied to just the party column using the subset argument.
  • The maximum value highlighting I do with the highlight_max built-in method and I highlight the cells a very light gray. 
  • The method set_table_styles restricts the width of the MPs column so the page renders on Blogger; it uses a CSS selector to do it, and of course you could use the same approach for fine grain formatting using CSS. 
  • The subset argument restricts formatting to just the specified columns.

Here's what the final results look like:

party MPs votes MPs frac votes frac
Conservative 365 13,966,565 0.562 0.452
Labour 202 10,269,076 0.311 0.333
Scottish Nationalist 48 1,242,380 0.074 0.040
Liberal Democrats 11 3,696,423 0.017 0.120
Democratic Unionists 8 244,128 0.012 0.008
Sinn Fein 7 181,853 0.011 0.006
Plaid Cymru 4 153,265 0.006 0.005
Social Democratic and Labour 2 118,737 0.003 0.004
Green 1 835,579 0.002 0.027
Alliance 1 134,115 0.002 0.004
Speaker 1 26,831 0.002 0.001


It's nice that Pandas has this functionality, and it's nice that it's as extensive as it is, but there's a problem. The way style is implemented is inconsistent and hard to understand, for example, some but not all of the string formatters work, and there are two methods that do very similar things (set_table_styles and set_properties). In practice, it takes more time and it's harder than it needs to be to get good results. The code looks ungainly too.  It is what it is for now.

Next steps

You can do some other clever things with style, like apply heatmaps, or apply clever conditional table formatting. You can really make your data output standout, but be careful, you can go overboard! To find out more, read the Pandas dataframe style documentation