Showing posts with label data visualization. Show all posts
Showing posts with label data visualization. Show all posts

Monday, January 25, 2021

3D plotting: how hard can it be?

Why aren't 2D plots good enough?

Most data visualization problems involve some form of two-dimensional plotting, for example plotting sales by month. Over the last two hundred years, analysts have developed several different types of 2D plots, including scatter charts, line charts, and bar charts, so we have all the chart types we need for 2D data. But what happens if we have a 3D dataset? 

The dataset I'm looking at is English Premier League (EPL) results. I want to know how the full-time scores are distributed, for example, are there more 1-1 results than 2-1 results? I have three numbers, the full-time home goals (FTHG), the full-time away goals (FTAG). and the number of games that had that score. How can I present this 3D data in a meaningful way? 

(You can't rely on 3D glasses to visualize 3D data. Image source: Wikimedia Commons, License: Creative Commons, Author: Oliver Olschewski)

Just the text

The easiest way to view the data is to create a table, so here it is. The columns are the away goals, the rows are the home goals, and the cell values are the number of matches with that result, so 778 is the number of matches with a score of 0-1.


This presentation is easy to do, and relatively easy to interpret. I can see 1-1 is the most popular score, followed by 1-0. You can also see that some scores just don't occur (9-9) and results with more than a handful of goals are very uncommon.

This is OK for a smallish dataset like this, but if there are hundreds of rows and/or columns, it's not really viable. So what can we do?

Heatmaps

A heatmap is a 2D map where the 3rd dimension is represented as color. The more intense (or lighter) the color, the higher the value. For this kind of plot to work, you do have to be careful about your color map. Usually, it's best to choose the intensity of just one color (e.g. shades of blue). In a few cases, multiple colors can work (colors for political parties), but those are the exceptions. 

Here's the same data plotted as a heatmap using the Brewer color palette "RdPu" (red-purple).

The plot does clearly show the structure. It's obvious there's a diagonal line beyond which no results occur. It's also obvious which scores are the most common. On the other hand, it's hard to get a sense of how quickly the frequency falls off because the human eye just isn't that sensitive to variations in color, but we could probably play around with the color scale to make the most important color variation occur over the range we're interested in. 

This is an easy plot to make because it's part of R's ggplot package. Here's my code:

plt_goal_heatmap <- goal_distribution %>% 
  ggplot(aes(FTHG, FTAG, fill=Matches)) + 
  geom_tile() +   
  scale_fill_distiller(palette = "RdPu") +
  ggtitle("Home/Away goal heatmap")

Perspective scatter plot

Another alternative is the perspective plot, which in R, you can create using the 'persp' function. This is a surface plot as you can see below.

You can change your perspective on the plot and view it from other angles, but even from this perspective, it's easy to see the very rapid falloff in frequency as the scores increase. 

However, I found this plot harder to use than the simple heatmap, and I found changing my viewing angle was awkward and time-consuming.

Here's my code in case it's useful to you:

persp(x = seq(0, max(goal_distribution$FTHG)), 
      y = seq(0, max(goal_distribution$FTAG)), 
      z = as.matrix(
        unname(
          spread(
            goal_distribution, FTAG, Matches, fill=0)[,-1])), 
      xlab = "FTHG", ylab = "FTAG", zlab = "Matches", 
      main = "Distribution of matches by score",
      theta = 60, phi = 20, 
      expand = 1, 
      col = "lightblue")

3D scatter plot

We can go one stage further and create a 3D scatter chart. On this chart, I've plotted the x, y, and z values and color-coded them so you get a sense of the magnitude of the z values. I've also connected the points to the axis (the zero plane if you like) to emphasize the data structure a bit more.



As with the persp function,  you can change your perspective on the plot and view it from another angle.

The downside with this approach is it requires the 'plot3D' library in R and it requires you to install a new graphics server (XQuartz). It's a chunk of work to get to a visualization. The function to draw the plot is 'scatter3D'. Here's my code:

scatter3D(x=goal_distribution$FTHG, 
          y=goal_distribution$FTAG, 
          z=goal_distribution$Matches, 
          xlab = "FTHG", ylab = "FTAG", zlab = "Matches",
          phi = 5, 
          theta = 40,
          bty = "g",  
          type = "h", 
          pch = 19,
          main="Distribution of matches by score",
          cex = 0.5)

What's my choice?

My goal was to understand the distribution of goals in the EPL, so what presentations of the data were most useful to me?

The simple table worked well and was the most informative, followed by the heatmap. I found both persp and scatter3D to be awkward to use and both consumed way more time than they were worth. The nice thing about the heatmap is that it's available as part of the wonderful ggplot library.

Bottom line: keep it simple.

Tuesday, October 6, 2020

Faster Python BI app development through code generation

Back to the future: design like it's 1999

Back in 1999, you could build Visual Basic apps by dragging and dropping visual components (widgets) onto a canvas. The Visual Basic IDE handled all the code generation, leaving you with the task of wiring up your new GUI to your business data. It wasn't just Visual Basic though, you could do the same thing with Visual C++ and other Microsoft versions of languages. The generated code wasn't the prettiest, but it worked, and it meant you could get the job done quickly.

(Microsoft Visual Basic. Image credit: Microsoft.)

Roll forward twenty years. Python is now very popular and people are writing all kinds of software using it, including software that needs UIs. Of course, the UI front-end is now the browser, which is another change. Sadly, nothing like the UI-building capabilities of the Microsoft Visual Studio IDE exists for Python; you can't build Python applications by dragging and dropping widgets onto a canvas.

Obviously, BI tools like Tableau and Qlik fulfill some of the need to quickly build visualization tools; they've inherited the UI building crown from Microsoft. Unfortunately, they run out of steam when the analysis is complex; they have limited statistical capabilities and they're not good as general-purpose programming languages.

If your apps are 'simple', obviously, Tableau or Qlik are the way to go. But what happens if your apps involve more complex analysis, or if you have data scientists who know Python but not Tableau?

What would it take to make a Visual Basic or Tableau-like app builder for Python? Could we build something like it?

Start with the end in mind

The end goal is to have a drag-and-drop interface that looks something like this.

(draw.io. Image credit: draw.io.)

On the left-hand side of the screenshot, there's a library of widgets the user can drag and drop onto a canvas. 

Ideally, we'd like to be able to design a multi-tabbed application and move widgets onto each tab from a library. We'd do all the visualization layout on the GUI editor and maybe set up some of the properties for the widgets from the UI too. For example, we might set up the table column names, or give a chart a title and axis titles. When we're done designing, we could press a button and generate outline code that would create an application with the (dummy) UI we want.

A step further would be to import existing Python code into the UI editor and move widgets from tab to tab, add new widgets, or delete unwanted widgets.

Conceptually, all the technology to do this exists right now, just not in one place. Unfortunately, it would take considerable effort to produce something like it. 

If we can't go all the way, can we at least go part of the way?

A journey of a thousand miles begins with a single step

A first step is code generation from a specification. The idea is simple: you define your UI in a specification file that software uses to generate code. 

For this first simple step (and the end goal), there are two things to bear in mind:

  • Almost all UI-based applications can be constructed using a Model-View-Controller architecture (pattern) or something that looks like it.
  • Python widgets are similar and follow well-known rules. For example, the widgets in Bokeh follow an API; a button follows certain rules, a dropdown menu follows certain rules and so on.

Given that there are big patterns and small patterns here, we could use a specification file to generate code for almost all UI-based applications.

I've created software that does this, and I'm going to tell you about it.

JSON and the argonauts

Here's an overview of how my code generation software works.

  • The Model-View-Controller code exists as a series of templates, with key features added at code generation time.
  • The application is specified in a JSON file. The JSON file contains details of each tab in the application, along with details of the widgets on the tab. The JSON file must follow certain rules; for example, no duplicate names.
  • Most of the rules for code generation are in a JSON schema file that contains details for each Bokeh widget. For example, the JSON schema has rules for how to implement a button, including how to create a callback function for a button.

Here's how it works in practice.

  1. The user creates a specification file in JSON. The JSON file has details of:
    • The overall project (name, copyright, author etc.)
    • Overall data for each tab (e.g. name of each tab and a description of what it does).
    • For each tab, there's a specification for each widget, giving its name, its argument, and a comment on what it does.
  2. The system checks the user's JSON specification file for consistency (well-formed JSON etc.)
  3. Using a JSON schema file that contains the rules for constructing Bokeh widgets, the system generates code for each Bokeh widget in the specification.
    • For each widget that could have a callback, the system generates the callback code.
    • For complex widgets like DataTable and FileInput, the system generates skeleton example code that shows how to implement the widget. In the DataTable case, it sets up a dummy data source and table columns.
  4. The system then adds the generated code to the Model-View-Controller templates and generates code for the entire project.
    • The generated code is PEP8 compliant by design.
The generated code is runnable, so you can test out how the UI looks.

Here's an excerpt from the JSON schema defining the rules for building widgets:

            "allOf":[

                    {

                      "$comment":"███ Button ███",

                      "if":{

                        "properties":{

                          "type":{

                            "const":"Button"

                          }

                        }

                      },

                      "then":{

                        "properties":{

                          "name":{

                            "$ref":"#/definitions/string_template_short"

                          },

                          "description":{

                            "$ref":"#/definitions/string_template_long"

                          },

                          "type":{

                            "$ref":"#/definitions/string_template_short"

                          },

                          "arguments":{

                            "type":"object",

                            "additionalProperties":false,

                            "required":[

                              "label"

                            ],

                            "properties":{

                              "label":{

                                "type":"string"

                              },

                              "sizing_mode":{

                                "type":"string",

                                "default":"stretch_width"

                              },

                              "button_type":{

                                "type":"string",

                                "default":"success"

                              }

                            }

                          },

Here's an excerpt from the JSON file defining an application's UI:

{

      "name":"Manage data",

      "description":"Panel to manage data sources.",

      "widgets":[

        {

          "name":"ECV year allocations",

          "description":"Displays the Electoral College Vote allocations by year.",

          "type":"TextInput",

          "disabled":true,

          "arguments":{

            "title":"Electoral College Vote allocations by year in system",

            "value":"No allocations in system"

          }

        },

        {

          "name":"Election results",

          "description":"Displays the election result years in the system.",

          "type":"TextInput",

          "disabled":true,

          "arguments":{

            "title":"Presidential Election results in system",

            "value":"No allocations in system"

          }

What this means in practice

Using this software, I can very rapidly prototype BI-like applications. The main task left is wiring up the widgets to the business data in the Model part of the Model-View-Controller architecture. This approach reduces the tedious part of UI development but doesn't entirely eliminate it. It also helps with widgets like DataTable that require a chunk of code to get them working - this software generates most of that code for you.

How things could be better

The software works, but not as well as it could:

  • It doesn't do layout. Laying out Bokeh widgets is a major nuisance and a time suck. 
  • The stubs for Bokeh DataTable are too short - ideally, the generated code should contain more detail which would help reduce the need to write code.
  • The Model-View-Controller architecture needs some cleanup.

The roadmap

I have a long shopping list of improvements:

  • Better Model-View-Controller
  • Robust exception handling in the generated code
  • Better stubs for Bokeh widgets like DataTable
  • Automatic Sphinx documentation
  • Layout automation

Is it worth it?

Yes and no.

For straightforward apps, it will still be several times faster to write apps in Tableau or Qlik. But if the app requires more statistical firepower, or complex analysis, or linkage to other systems, then Python wins and this approach is worth taking. If you have access to Python developers, but not Tableau developers, then once again, this approach wins.

Over the longer term, regardless of my efforts, I can clearly see Python tools evolving to the state where they can compete with Qlik and Tableau for speed of application development.

Maybe in five years' time, we'll have all of the functionality we had 25 years ago. What's old is new again.

Wednesday, May 6, 2020

Florence Nightingale, data analyst

Introduction - why do I care about Florence Nightingale, data analyst?

I've used statistics and data visualizations for a long time now, and in the last few years, I've become increasingly interested in where the methods I use come from. Who were the founding figures of statistics and visualization? Why was their work important? How did their work influence the world? As I've looked back in time, I've found the data science creation stories more interesting than I thought. There were real people who struggled to achieve their goals and used early data science methods to do so. One of these pioneers was Florence Nightingale, more famous for founding modern nursing, but a key figure in analytics and data visualization. What she did and why she did it have clear lessons for analysts today.

(Simon Harriyott from Uckfield, England, CC BY 2.0, via Wikimedia Commons)

Early life

Florence was born on May 12th, 1820, near Florence in Italy. Her parents were wealthy and very well-connected, two factors that were to have a big impact on her later life. As the second daughter, she was expected to have the learning of a woman of her station and to marry well; her family, especially her mother, had a very definite expectation of the role she was to fulfill. Her upbringing was almost like a character from a Jane Austen novel, which was to cause Florence mental health problems.

Initially, the family lived in a fifteen-bedroom house in Derbyshire, but this was too small for them (!) and they wanted to be nearer to London, so they moved to Embley in the New Forest. They also had an apartment in London and spent a lot of time in the city. Given the family connections and their time spent in London, it’s not surprising that Florence met many influential men and women growing up, including future prime ministers and a young Queen Victoria. This was to be crucially important to her later.

Up until she was 12, Florence was educated by a governess, then her father took over her education. Unusually for the time, her father believed in equality of education for women and put considerable effort into educating his daughters [Bostridge]. Notably, she received no formal schooling and never took anything like university lectures or courses, however, she had a precocious intellect and had an appetite for statistics and data. When she was 17, the family took a six-month vacation to Italy, and along the way, Florence recorded their departure and arrival times, the distances they traveled, and kept notes on local conditions and laws [Bostridge, Huxley].

Throughout her life, she was deeply religious, and in her teenage years, she felt a call from God to do something useful, she wanted ‘some regular occupation, for something worth doing instead of frittering time away on useless trifles’ [Huxley]. On the 7th of February 1837, Florence recorded “...God spoke to me and called me to His service”, but what the form of that call was, Florence didn’t note [Bostridge]. This theme of a calling from God was to come up several times in her life.

Bear in mind, Florence’s life was a round of socializing to prepare her for an appropriate marriage, nothing more. For an intellectually gifted woman wanting to make a difference in the world, the tension between the life she wanted and the life she had was immense. It’s not a surprise to hear that she was often withdrawn and on the verge of a nervous breakdown; in modern times, she may well have been diagnosed with depression. By the age of 30, Florence wasn’t married, something that wasn’t respectable - however, she was to shock her family with a very disreputable request.

Introduction to nursing

Florence decided that nursing was her calling, unfortunately, her parents violently objected, and with good reason.

At the time, nursing was considered a disreputable profession. Hospitals were filthy and nurses were both ill-trained and poorly educated. In many cases, their role was little more than cleaning up the hospital messes, and in the worst cases, they were promiscuous with doctors and surgeons [Huxley]. It was also known that nurses were present at operations, which in the 1850s were bloody, gruesome affairs. Even Charles Dickens had a poor view of nurses. In Martin Chuzzlewit, published in 1843, Dickens created a character, Sarah Gamp, who was sloppy, a drunk, and a nurse. Dickens was playing to a well-known stereotype and adding to it.

Nursing as a profession was about as far away from a suitable occupation for Florence as you can imagine. Her family knew all about nursing’s reputation and vigorously objected to Florence having anything to do with it. Her mother in particular opposed Florence learning or practicing nursing for a very long time, going as far as actively blocking Florence’s training. However, Florence could read about nursing and health, which she did copiously.

There was one bright nursing light; the Institution of Deaconesses at Kaiserworth (Germany) was a quasi-religious institute that sought to improve nursing standards. Florence wanted to study there, but her parents stopped her. She managed to go for two weeks in 1850, but only with some shenanigans. Perhaps because of the deception, when she came back, she anonymously published a 32-page pamphlet on her experience which is her first known published work [Nightingale 1851]. After some blazing stand-up rows with her mother, she finally went for three months of training in 1853. Bear in mind, her family still controlled her life, even at this late age.

The discipline at Kaiserworth was harsh and the living conditions were spartan. Days consisted of prayer and patient support, in effect, it was living a religious life while learning nursing, fulfilling two of Florence’s needs. She learned the state of nursing as it stood at the time, even witnessing amputations and other operations, which would have horrified her parents had they known. However, Florence appreciated the limitations of the Kaiserworth system.

On her return to Britain, her appetite for nursing wasn’t diminished, in fact, she read widely about nursing, disease in general, and statistics - broadening her knowledge base. What was missing was an opportunity to practice what she’d learned, which finally arrived in April 1853. 

Through her extensive family connections, she was made superintendent of a new ‘Institution for the Care of Sick Gentlewomen’ based in Harley Street in London. This was a combination of hospital and recuperation unit for sick women, with the goal of providing a better standard of care than was currently offered. With Florence, the founders thought they were getting a hands-off lady of leisure, instead, they got a human dynamo who was waiting to put into practice years of learning and preparation. Not only did Florence do nursing, she also fought on committees to get the funding she needed, became a tough people manager, and put the institution’s finances in order. Under Florence’s guidance, the institution became groundbreaking in simple but effective ways; it treated its patients well, it was clean, and its nurses were professional.

Had she continued in Harley Street, she probably would have still been a founding figure of modern nursing, but events elsewhere were conspiring to thrust her into the limelight and make her a national hero.

The Crimean War

Britain has fought almost every country in Europe many times. Sometimes with the French and sometimes against the French. By the mid-1850s, Britain and France were becoming worried about the influence of Russia in the Middle East, which resulted in the Crimean War, where Britain and France fought Russia [Britannica]. This was a disastrous war for pretty much everyone.

Painting of the Siege of Sevastapol
(Siege of Sevastopol (1854–55), Franz Roubaud)

British troops were shipped to Turkey to fight the Russians. Unfortunately, cholera, diarrhea, and dysentery ripped through the men, resulting in large numbers of casualties before the war had even started; the men were too sick to fight. Of the 30,000 British troops dispatched to Turkey, 1,000 died of disease before a single shot was fired [Bostridge].

Hospitals were squalid and poorly equipped; the main British hospital at Scutari was a national shame; men were trying to recover from their injuries in filthy conditions with poor food and limited supplies. The situation was made worse by bureaucratic blundering and blind rule-following, there were instances of supplies left to rot because committees hadn’t approved their release. By contrast, the French were well-equipped and were running effective field hospitals.

In an early example of embedded journalism, William Howard Russell provided dispatches for The Times exposing the poor treatment of the troops, incompetent management, and even worse, the superiority of the French. His reports riled up the British people, who in turn pressured politicians to do something; it became politically imperative to take action [Huxley].

Florence in Crimea

War and medicine were male preserves, but politicians needed votes, meaning change came quickly. Russell’s dispatches made it clear that troops were dying in hospital, not on the battlefield, so medical support was needed. This is where Florence’s family connections came in. Sidney Herbert, Secretary at War, wrote to Florence asking her to run nursing operations in the Crimea. The War Office needed to give Florence a title, so they called her ‘Superintendent of the Female Nursing Establishment of the English General Military Hospitals in Turkey’. Nothing like this had ever been done before - women had never been sent to support war - which would cause problems later.

Florence was asked to recruit 50 nurses, but there were no female nurses at all in the British Army, and nursing was in its infancy. She found 14 women with hospital experience and several nuns from various religious orders - 38 women in total. On October 21st, 1854, this rag-tag army set out from England to go to the war in the Crimea.

The conditions they found in the barrack hospital at Scutari were shocking. The place was filthy and vermin-infested, rats were running around in plain view, and even the kitchens weren’t clean. Bedding and clothing weren’t washed, which meant soldiers preferred to keep their existing filthy bedding and clothing rather than changing them for someone else's equally unclean items - better to have your own lice bite you than someone else’s.  Basics like furniture were in short supply, there weren’t even enough tables for operations. Soldiers were left untreated for long periods of time, and there were many cases when maggots weren’t cleaned out of wounds. Unsurprisingly, cholera and dysentery were rampant. The death rate was high. As a further twist, the military wasn’t even using the whole building, the cellars had refugees living in them, and there was a prostitution ring operating there [Huxley].


(The military hospital at Scutari. Image source: The Wellcome Collection. License: Creative Commons.)

Florence wanted to make a difference, but military rules and misogyny prevented her nurses from taking up their duties. Her title was, “Superintendent of the Female Nursing Establishment of the English General Hospitals in Turkey”, but military orders didn’t say what she was to do. This was enough of an excuse for the (male) doctors and surgeons to block her nurses. Despite being blocked, the nurses did what they could to improve things, by ensuring clean bedding and better quality food for example.

Things changed, but for the worst reason. The Battle of Balaclava brought a tidal wave of wounded into the hospital, too many for the existing system to cope with, so the military gave in and let the women in. Florence’s nurses finally got to nurse.

Given her opportunity, Florence moved quickly to establish hygiene, cleanliness, and good nutrition. The rats were dispatched, the tenants in the basement were removed, and food quality was improved. Very unusually for the time, Florence insisted on hand washing, which of itself reduced the death rate [Globalhandwashing]. Back in London, The Times had established a fund to care for wounded soldiers, so Florence had a pot of money to spend as she chose, free of military rules. She set up contracts with local suppliers to improve the food supply, she set up washrooms to clean bedding and clothes, and she provided soldiers with new, clean clothing.

Her nurses tended to the men during the daytime, treating their wounds and ensuring they were clean and cared for. Florence’s administrative work tied her up in the daytime, but she was able to walk the wards at night to check on the men. She nursed them too and stayed with them as they died. Over the winter of 1855/1856, it’s estimated she saw something like 2,000 men die.

To light her way on her nocturnal rounds, she used a Turkish lamp. This is where the legend of the ‘lady with the lamp’ came from. Under desperate conditions, men would see a beacon of hope in the darkness. This is such a strong legend in UK culture that even 170 years later, it still resonates.

Drawing of Florence doing her rounds
(Illustrated London News, 24 Feb 1855, Source: Wikimedia Commons)

The difference Florence’s nurses made was eagerly reported back to the British public who were desperate for a good news story. The story was perfect, a heroine making a difference under terrible conditions while being blocked by the intransigence of military bureaucracy, and the ‘lady with the lamp’ image sold well. The donations came rolling in.

A highly fanciful representation of Florence
(A fanciful depiction of Florence doing her rounds. Creative Commons license.)

In May 1855, Florence got closer to the Crimean War when she toured Balaclava in the Crimea itself. Unfortunately, on 13th May 1855, she collapsed through exhaustion and became gravely ill, suffering fevers and delirium. The word was, she was close to death. On hearing of her condition, it’s said the patients in the Scutari hospital turned towards the wall and wept. Florence recovered, but she continued to suffer debilitating illness for the rest of her long life.

The war finally ended on 30th March 1856, and Florence returned to England in July of the same year. She left an unknown but came back a celebrity.

Florence as a data analyst and statistician

The Crimean War was a disaster for the British military and the public was angry; the political fall-out continued after the war was over and the poor medical treatment the troops received was a hot topic. After some delay, a “Royal Commission on the Health of the Army” was formed to investigate the health of the British Army, and Florence was its powerhouse. Sadly, as a woman, she couldn't formally be appointed to the Commission, so her role was less formal. Despite the informality, she was determined to prove her points with data and to communicate clearly with the public.

In the 1850s, statistics was in its infancy, but there were some early pioneers, including Willam Farr at the General Registry Office who was an early epidemiologist and one of the founders of medical statistics. Of course, Florence was a friend of Farr’s. Farr had introduced the idea of comparing the mortality rates of different occupations, which Florence was to run with [Cohen]. He also had a dismal view of data visualization which Florence disagreed with.

Florence’s stand-out piece of work is her report “Mortality of the British Army: at home and abroad, and during the Russian war, as compared with the mortality of the civil population in England.” which was appended to the Commission's main report. She knew she needed to reach the general public who wouldn’t read a huge and dull tome, she had to make an impact quickly and clearly, and she did so through the use of tables and data visualization. Bear in mind, the use of charts was in its infancy.

Here's one of the tables from her report, it's startlingly modern in its presentation. The key column is the one on right, the excess of deaths in the army compared to the general population. The excess deaths weren't due to warfare.

Incredibly, the excess of deaths was due to disease as we can see in the table below. The death rate for the general population for 'chest and tubercular disease' was 4.5 per 1,000, but for the army, it was 10.1. Tubercular disease isn't a disease of war, it's a disease of poor living conditions and poor sanitation.

The report is full of these kinds of tables, presented in a clear and compelling way that helped tell the terrible story: the British Army was killing its own soldiers through neglect.

Of course, tables are dry; charts make a more immediate impression and Florence used bar charts to great effect. Here's a bar chart of death by age group for the British Army (red) and the general population (black). Bear in mind, the period leading up to the Crimean War was peaceful - there were no major engagements, so the excess deaths aren't battle casualties. In fact, as Florence showed in the tables and in the charts, these excess death were avoidable.

In private, Florence was more forceful about the effect of poor medical treatment on the strength of the army. Salisbury Plain was (and is), a big British Army practice area, and she said: "it is as criminal to have a mortality of 17, 19, and 20 per thousand in the Line, Artillery and Guards, when in civilian life it is on 11 per thousand as it would be to take 1,100 men every year out upon Salisbury Plain and shoot them" [Kopf].

The death toll is shocking in human terms, but it also has a profound impact in terms of the army's efficiency, fighting ability, and recruitment needs. Men dying early means a loss of experience and a continued high need for recruitment. Florence illustrated the impact of early deaths with a pair of charts I've shown below.

The chart on the left showed the effect of disease at home on the army. The chart on the right showed what would happen if death rates came down to those of the general population. If people didn't care about lives, they might care about the strength of the army and do something about medical care.

The Royal Commission wasn't the end of it. A little later, Florence produced yet another report, "Notes on matters affecting the health, efficiency, and hospital administration of the British Army: founded chiefly on the experience of the late war". This report is notable because it contains the famous coxcomb plot. If you read anything about Florence and visualization online, this is what you'll find. I'm going to take some time to explain it because it's so fundamental in the history of data visualization.

(I should note that Florence never called these plots coxcomb plots, the use of the term came far later and not from her. However, the internet calls these charts coxcomb plots and I'm going to follow the herd for now.)

The visualization takes its name from the comb on a rooster's head.

(Image credit: Lander. Source. License Creative Commons.)

There are two coxcomb plots in the report, appearing on the same pull-out page. To make it easier to understand them, I'm going to show you the two plots separately.

The plot is divided into twelve segments, one for each month from April 1854 to March 1855. The area of each segment represents the number of deaths. The red wedges are deaths from wounds, the blue (gray in the image) represents deaths from preventable diseases, and the black wedges are deaths from other causes. You can plainly see the battle deaths. But what's really shocking is the number of deaths from preventable diseases. Soldiers are dying in battle, but many more of them are dying from preventable diseases. In other words, the soldiers didn't have to die.

Here's the other part of the diagram, from April 1855 to March 1856 (the end of the war) - not to scale with the previous plot.

Interestingly, Florence preferred the coxcomb plots to bar charts because she felt they were more mathematically accurate.

Although William Farr was an advisor to Florence and involved in building the coxcomb plots, he wasn't a fan of data visualization. He advised her that 'statistics should be as dry as possible' [Bostridge]. But Florence's aim was influencing the public, not a stone-cold presentation of data. In the introduction, I said there were lessons that modern analysts could learn from Florence, and this is the key one: you have to communicate your results clearly to a general audience to influence opinion and effect change.

The lessons from Florence's analysis are very clear: the men in the British Army were dying through poor treatment. They were dying at home, and dying after battle. The disaster in the Crimea was avoidable.

The Commission had far-reaching effects, specifically, the radical restructuring of the British Army's healthcare system, including the construction of a new army hospital. Florence had firm views on hospital design, which the new hospital didn't meet. Unfortunately, by the time she was involved in the project, it was too late to change some of the design basics, but she did manage to make it less bad. Radical reform doesn't happen overnight, and that was the case here. 

Florence's friend, Lord Herbert carried out a series of reforms over many years. Unfortunately, he died 1861. Two years later, Florence published a monograph in his honor, "Army Sanitary Administration, and Its Reform under the Late Lord Herbert", which included more charts and data [McDonald]. As before, Florence's goal was communication, but this time communicating the impact her friend and collaborator had on saving lives.

Florence was famous by the 1860s, famous enough to have an early photograph taken.


Florence and nursing

Quite rightly, Florence is considered one of the founding figures of modern nursing. She wrote a short book (75 pages), called "Notes on nursing: what it is and what it is not", which was by far her most widely read publication and stayed in print for a long time. In 1860, St Thomas's hospital in London opened a nursing school with Florence as an advisor, this was the "Nightingale Training School for Nurses", which was to set the standard for nursing education.

Florence and public health

The illness she picked up in the Crimea prevented her from traveling but didn't prevent her from absorbing data and influencing public health. In 1859, she took part in a Royal Commission, the "Royal Commission on the Sanitary State of the Army in India", which aimed to do for the British Army in India what the previous Royal Commission did for the Army in Britain. Sadly, the story was the same as the Crimea, poor health leading to premature death. Once again, Florence illustrated her work with visualizations and statistics. 

This report is notable for another type of visualization: woodcut drawings. Royal Commission reports are known to be dull, worthy affairs, but Florence wanted her work to be read and she knew she had to reach a wider audience (the same lesson about communicating effectively to create change). Her relative, Hilary Bonham Carter, drew the woodcuts she included in her report. The Treasury balked at the printing costs and wanted the report without the woodcuts, but Florence knew that some people would only read the report for the woodcuts, so she insisted they be included. Her decision was the right one, by communicating clearly, she was more effective in winning reforms.

(Image source: Wikimedia Commons)

Sadly, as a woman, Florence couldn't formally be part of the Commission, despite her huge input.

To use statistics to understand what's going on requires agreement and consistency in data collection. If different authorities record illnesses differently, then there can be no comparison and no change. Florence realized the need for consistent definitions of disease and proposed a classification scheme that was endorsed by the International Statistical Congress, held in London in 1860 [Magnello]. Sadly, only a few hospitals adopted her scheme and an opportunity to improve healthcare through data was lost.

Hospital design 

In 1859, Florence's writings on hospital design were consolidated into a book 'Notes on Hospitals' which led her to become the leading authority on hospital design.  Many British cities asked her to consult on their proposed hospital-building programs, as did the Government of India, the Queen of Holland, and the King of Portugal.

Decline and death

She never enjoyed good health after the Crimea, and never again traveled far from home. In her later years, she spent her time at home with her cats, occasionally doling out nursing or public health advice. In her last few years, her mental acuity fell away, and she retreated from public life. She died in 1910, aged 90.

(Florence shortly before her death in 1910. Lizzie Caswall Smith. Source: Wikimedia Commons.)

Florence as a Victorian

Florence was very much a product of her time and her class, she wasn't a feminist icon and she wasn't an advocate for the working classes - in many ways, she was the reverse [Stanley]. I've read some quotes from her which are quite shocking to modern ears [Bostridge]. However, I'm with the historians here, we have to understand people in their context and not expect them to behave in modern ways or judge them against modern standards.

Florence’s legacy

During her life, she received numerous honors, and the honors continued after her death.

The Royal Statistical Society was founded in 1834 as the Statistical Society of London, and Florence became its first female member in 1858 and was elected a Fellow in 1859. The American Statistical Association gave her honorary membership in 1874.

The Queen’s head appears on all British banknotes, but on the other side, there’s usually someone of historical note. On the £10 note, from 1975-1992, it was Florence Nightingale, the first woman to be featured on a banknote [BoE].

(UK £10 note)

For a very long time, many British hospitals have had a Nightingale ward. Things went a step further in response to the coronavirus pandemic; the British Army turned large conference centers into emergency hospitals for the infected, for example, the ExCel Center in London was turned into a hospital in nine days. Other large conference venues in the UK were also converted. The name of these hospitals? Nightingale Hospitals.

Her legend and what it says about society

Florence Nightingale is a revered figure in nursing, and rightly so, but her fame in the UK extends beyond the medical world to the general population. She’s known as the founder of nursing, and the story of the “lady with the lamp” still resonates. But less well-known is her analysis work on soldiers’ deaths during the war, her work on hospital design, and her role in improving public health. She probably saved more lives with her work after Crimea than she did during the Crimean War. Outside of the data analytics world, her ground-breaking visualizations are largely unknown. In my view, there’s definitely gender stereotyping going on; it’s fine for a woman to be a caring nurse, but not fine for her to be a pioneering public health analyst. Who society chooses as its heroes is very telling, but what society chooses to celebrate about them is even more telling.

The takeaways for analysts

I've read a lot on Florence's coxcomb charts, but less on her use of tables, and even less on her use of woodcut illustrations. The discussions mostly miss the point; Florence used these devices as a way of communicating a clear message to a wide audience, her message was all about the need for change. The diagrams weren't the goal, they were a means to an end - she spent a lot of time thinking about how to present data meaningfully; a lesson modern analysts should take to heart.

References

[BofE] https://www.bankofengland.co.uk/museum/noteworthy-women/historical-women-on-banknotes
[Bostridge] Mark Bostridge, “Florence Nightingale The Making Of An Icon”, Farrar, Straus, and Giroux, New York, 2008
[Britannica] https://www.britannica.com/event/Crimean-War
[Cohen] I Bernard Cohen, "Florence Nightingale", Scientific American, 250(3):128-137, March 1984 
[Kopf] Edwin Kopf, "Florence Nightingale as Statistician", Publications of the American Statistical Association, Vol. 15, No. 116 (Dec., 1916), pp. 388-404
[Globalhandwashing] https://globalhandwashing.org/about-handwashing/history-of-handwashing/
[Huxley] Elspeth Huxley, “Florence Nightingale”, G.P. Putnam’s Sons, New York, 1975
[Magnello] https://plus.maths.org/content/florence-nightingale-compassionate-statistician 
[McDonald] https://rss.onlinelibrary.wiley.com/doi/10.1111/1740-9713.01374
[Nightingale 1851] Florence Nightingale, “The institution of Kaiserswerth on the Rhine, for the practical training of deaconesses”, 1851
[Stanley] David Stanley, Amanda Sherratt, "Lamp light on leadership: clinical leadership and Florence Nightingale", Journal of Nursing Management, 18, 115–121, 2010

Tuesday, March 24, 2020

John Snow, cholera, and the origins of data science

The John Snow story is so well known, it borders on the cliched, but I discovered some twists and turns I hadn't known that shed new light on what happened and on how to interpret Snow's results. Snow's story isn't just a foundational story for epidemiology, it's a foundational story for data science too.


(Image credit: Cholera bacteria, CDC; Broad Street pump, Betsy Weber; John Snow, Wikipedia)

To very briefly summarize: John Snow was a nineteenth-century doctor with an interest in epidemiology and cholera. When cholera hit London in 1854, he played a pivotal role in understanding cholera in two quite different ways, both of which are early examples of data science practices.

The first way was his use of registry data recording the number of cholera deaths by London district. Snow was able to link the prevalence of deaths to the water company that supplied water to each district. The Southwark & Vauxhall water company sourced their water from a relatively polluted part of the river Thames, while the Lambeth water company took their water from a relatively unpolluted part of the Thames. As it turned out, there was a clear relationship between drinking water source and cholera deaths, with polluted water leading to more deaths.

This wasn't a randomized control trial, but was instead an early form of difference-in-difference analysis. Difference-in-difference analysis was popularized by Card and Krueger in the mid-1990's and is now widely used in econometrics and other disciplines. Notably, there are many difference-in-difference tutorials that use Snow's data set to teach the method. 

I've reproduced one of Snow's key tables below, the most important piece is the summary at the bottom comparing deaths from cholera by water supply company. You can see the attraction of this dataset for data scientists, it's calling out for the use of groupby.

The second way is a more dramatic tale and guaranteed his continuing fame. In 1854, there was an outbreak of cholera in the Golden Square part of Soho in London. Right from the start, Snow suspected the water pump at Broad Street was the source of the infection. Snow conducted door-to-door inquiries, asking what people ate and drank. He was able to establish that people who drank water from the pump died at a much higher rate than those that did not. The authorities were desperate to stop the infection, and despite the controversial nature of Snow's work, they listened and took action; famously, they removed the pump handle and the cholera outbreak stopped.

Snow continued his analysis after the pump handle was removed and wrote up his results (along with the district study I mentioned above) in a book published in 1855. In the second edition of his book, he included his famous map, which became an iconic data visualization for data science. 

Snow knew where the water pumps were and knew where deaths had occurred. He merged this data into a map-bar chart combination; he started with a street map of the Soho area and placed a bar for each death that occurred at an address. His map showed a concentration of deaths near the Broad Street pump.

I've reproduced a section of his map below. The Broad Street pump I've highlighted in red and you can see a high concentration of deaths nearby. There are two properties that suffered few deaths despite being near the pump, the workhouse and the brewery. I've highlighted the workhouse in green. Despite housing a large number of people, few died. The workhouse had its own water supply, entirely separate from the Broad Street pump. The brewery (highlighted in yellow) had no deaths either; they supplied their workers with free beer (made from boiled water).


(Source: adapted from Wikipedia)

I've been fascinated with this story for a while now, and recent events caused me to take a closer look. There's a tremendous amount of this story that I've left out, including:

  • The cholera bacteria and the history of cholera infections.
  • The state of medical knowledge at the time and how the prevailing theory blocked progress on preventing and treating cholera.
  • The intellectual backlash against John Snow.
  • The 21st century controversy surrounding the John Snow pub.

I've written up the full story in a longer article you can get from my website. Here's a link to my longer article.