Showing posts with label data visualization. Show all posts
Showing posts with label data visualization. Show all posts

Wednesday, May 6, 2020

Florence Nightingale, data analyst

Introduction - why do I care about Florence Nightingale, data analyst?

I've used statistics and data visualizations for a long time now, and in the last few years, I've become increasingly interested in where the methods I use come from. Who were the founding figures of statistics and visualization? Why was their work important? How did their work influence the world? As I've looked back in time, I've found the data science creation stories more interesting than I thought. There were real people who struggled to achieve their goals and used early data science methods to do so. One of these pioneers was Florence Nightingale, more famous for founding modern nursing, but a key figure in analytics and data visualization. What she did and why she did it have clear lessons for analysts today.

(Simon Harriyott from Uckfield, England, CC BY 2.0, via Wikimedia Commons)

Early life

Florence was born on May 12th, 1820, near Florence in Italy. Her parents were wealthy and very well-connected, two factors that were to have a big impact on her later life. As the second daughter, she was expected to have the learning of a woman of her station and to marry well; her family, especially her mother, had a very definite expectation of the role she was to fulfill. Her upbringing was almost like a character from a Jane Austen novel, which was to cause Florence mental health problems.

Initially, the family lived in a fifteen-bedroom house in Derbyshire, but this was too small for them (!) and they wanted to be nearer to London, so they moved to Embley in the New Forest. They also had an apartment in London and spent a lot of time in the city. Given the family connections and their time spent in London, it’s not surprising that Florence met many influential men and women growing up, including future prime ministers and a young Queen Victoria. This was to be crucially important to her later.

Up until she was 12, Florence was educated by a governess, then her father took over her education. Unusually for the time, her father believed in equality of education for women and put considerable effort into educating his daughters [Bostridge]. Notably, she received no formal schooling and never took anything like university lectures or courses, however, she had a precocious intellect and had an appetite for statistics and data. When she was 17, the family took a six-month vacation to Italy, and along the way, Florence recorded their departure and arrival times, the distances they traveled, and kept notes on local conditions and laws [Bostridge, Huxley].

Throughout her life, she was deeply religious, and in her teenage years, she felt a call from God to do something useful, she wanted ‘some regular occupation, for something worth doing instead of frittering time away on useless trifles’ [Huxley]. On the 7th of February 1837, Florence recorded “...God spoke to me and called me to His service”, but what the form of that call was, Florence didn’t note [Bostridge]. This theme of a calling from God was to come up several times in her life.

Bear in mind, Florence’s life was a round of socializing to prepare her for an appropriate marriage, nothing more. For an intellectually gifted woman wanting to make a difference in the world, the tension between the life she wanted and the life she had was immense. It’s not a surprise to hear that she was often withdrawn and on the verge of a nervous breakdown; in modern times, she may well have been diagnosed with depression. By the age of 30, Florence wasn’t married, something that wasn’t respectable - however, she was to shock her family with a very disreputable request.

Introduction to nursing

Florence decided that nursing was her calling, unfortunately, her parents violently objected, and with good reason.

At the time, nursing was considered a disreputable profession. Hospitals were filthy and nurses were both ill-trained and poorly educated. In many cases, their role was little more than cleaning up the hospital messes, and in the worst cases, they were promiscuous with doctors and surgeons [Huxley]. It was also known that nurses were present at operations, which in the 1850s were bloody, gruesome affairs. Even Charles Dickens had a poor view of nurses. In Martin Chuzzlewit, published in 1843, Dickens created a character, Sarah Gamp, who was sloppy, a drunk, and a nurse. Dickens was playing to a well-known stereotype and adding to it.

Nursing as a profession was about as far away from a suitable occupation for Florence as you can imagine. Her family knew all about nursing’s reputation and vigorously objected to Florence having anything to do with it. Her mother in particular opposed Florence learning or practicing nursing for a very long time, going as far as actively blocking Florence’s training. However, Florence could read about nursing and health, which she did copiously.

There was one bright nursing light; the Institution of Deaconesses at Kaiserworth (Germany) was a quasi-religious institute that sought to improve nursing standards. Florence wanted to study there, but her parents stopped her. She managed to go for two weeks in 1850, but only with some shenanigans. Perhaps because of the deception, when she came back, she anonymously published a 32-page pamphlet on her experience which is her first known published work [Nightingale 1851]. After some blazing stand-up rows with her mother, she finally went for three months of training in 1853. Bear in mind, her family still controlled her life, even at this late age.

The discipline at Kaiserworth was harsh and the living conditions were spartan. Days consisted of prayer and patient support, in effect, it was living a religious life while learning nursing, fulfilling two of Florence’s needs. She learned the state of nursing as it stood at the time, even witnessing amputations and other operations, which would have horrified her parents had they known. However, Florence appreciated the limitations of the Kaiserworth system.

On her return to Britain, her appetite for nursing wasn’t diminished, in fact, she read widely about nursing, disease in general, and statistics - broadening her knowledge base. What was missing was an opportunity to practice what she’d learned, which finally arrived in April 1853. 

Through her extensive family connections, she was made superintendent of a new ‘Institution for the Care of Sick Gentlewomen’ based in Harley Street in London. This was a combination of hospital and recuperation unit for sick women, with the goal of providing a better standard of care than was currently offered. With Florence, the founders thought they were getting a hands-off lady of leisure, instead, they got a human dynamo who was waiting to put into practice years of learning and preparation. Not only did Florence do nursing, she also fought on committees to get the funding she needed, became a tough people manager, and put the institution’s finances in order. Under Florence’s guidance, the institution became groundbreaking in simple but effective ways; it treated its patients well, it was clean, and its nurses were professional.

Had she continued in Harley Street, she probably would have still been a founding figure of modern nursing, but events elsewhere were conspiring to thrust her into the limelight and make her a national hero.

The Crimean War

Britain has fought almost every country in Europe many times. Sometimes with the French and sometimes against the French. By the mid-1850s, Britain and France were becoming worried about the influence of Russia in the Middle East, which resulted in the Crimean War, where Britain and France fought Russia [Britannica]. This was a disastrous war for pretty much everyone.

Painting of the Siege of Sevastapol
(Siege of Sevastopol (1854–55), Franz Roubaud)

British troops were shipped to Turkey to fight the Russians. Unfortunately, cholera, diarrhea, and dysentery ripped through the men, resulting in large numbers of casualties before the war had even started; the men were too sick to fight. Of the 30,000 British troops dispatched to Turkey, 1,000 died of disease before a single shot was fired [Bostridge].

Hospitals were squalid and poorly equipped; the main British hospital at Scutari was a national shame; men were trying to recover from their injuries in filthy conditions with poor food and limited supplies. The situation was made worse by bureaucratic blundering and blind rule-following, there were instances of supplies left to rot because committees hadn’t approved their release. By contrast, the French were well-equipped and were running effective field hospitals.

In an early example of embedded journalism, William Howard Russell provided dispatches for The Times exposing the poor treatment of the troops, incompetent management, and even worse, the superiority of the French. His reports riled up the British people, who in turn pressured politicians to do something; it became politically imperative to take action [Huxley].

Florence in Crimea

War and medicine were male preserves, but politicians needed votes, meaning change came quickly. Russell’s dispatches made it clear that troops were dying in hospital, not on the battlefield, so medical support was needed. This is where Florence’s family connections came in. Sidney Herbert, Secretary at War, wrote to Florence asking her to run nursing operations in the Crimea. The War Office needed to give Florence a title, so they called her ‘Superintendent of the Female Nursing Establishment of the English General Military Hospitals in Turkey’. Nothing like this had ever been done before - women had never been sent to support war - which would cause problems later.

Florence was asked to recruit 50 nurses, but there were no female nurses at all in the British Army, and nursing was in its infancy. She found 14 women with hospital experience and several nuns from various religious orders - 38 women in total. On October 21st, 1854, this rag-tag army set out from England to go to the war in the Crimea.

The conditions they found in the barrack hospital at Scutari were shocking. The place was filthy and vermin-infested, rats were running around in plain view, and even the kitchens weren’t clean. Bedding and clothing weren’t washed, which meant soldiers preferred to keep their existing filthy bedding and clothing rather than changing them for someone else's equally unclean items - better to have your own lice bite you than someone else’s.  Basics like furniture were in short supply, there weren’t even enough tables for operations. Soldiers were left untreated for long periods of time, and there were many cases when maggots weren’t cleaned out of wounds. Unsurprisingly, cholera and dysentery were rampant. The death rate was high. As a further twist, the military wasn’t even using the whole building, the cellars had refugees living in them, and there was a prostitution ring operating there [Huxley].


(The military hospital at Scutari. Image source: The Wellcome Collection. License: Creative Commons.)

Florence wanted to make a difference, but military rules and misogyny prevented her nurses from taking up their duties. Her title was, “Superintendent of the Female Nursing Establishment of the English General Hospitals in Turkey”, but military orders didn’t say what she was to do. This was enough of an excuse for the (male) doctors and surgeons to block her nurses. Despite being blocked, the nurses did what they could to improve things, by ensuring clean bedding and better quality food for example.

Things changed, but for the worst reason. The Battle of Balaclava brought a tidal wave of wounded into the hospital, too many for the existing system to cope with, so the military gave in and let the women in. Florence’s nurses finally got to nurse.

Given her opportunity, Florence moved quickly to establish hygiene, cleanliness, and good nutrition. The rats were dispatched, the tenants in the basement were removed, and food quality was improved. Very unusually for the time, Florence insisted on hand washing, which of itself reduced the death rate [Globalhandwashing]. Back in London, The Times had established a fund to care for wounded soldiers, so Florence had a pot of money to spend as she chose, free of military rules. She set up contracts with local suppliers to improve the food supply, she set up washrooms to clean bedding and clothes, and she provided soldiers with new, clean clothing.

Her nurses tended to the men during the daytime, treating their wounds and ensuring they were clean and cared for. Florence’s administrative work tied her up in the daytime, but she was able to walk the wards at night to check on the men. She nursed them too and stayed with them as they died. Over the winter of 1855/1856, it’s estimated she saw something like 2,000 men die.

To light her way on her nocturnal rounds, she used a Turkish lamp. This is where the legend of the ‘lady with the lamp’ came from. Under desperate conditions, men would see a beacon of hope in the darkness. This is such a strong legend in UK culture that even 170 years later, it still resonates.

Drawing of Florence doing her rounds
(Illustrated London News, 24 Feb 1855, Source: Wikimedia Commons)

The difference Florence’s nurses made was eagerly reported back to the British public who were desperate for a good news story. The story was perfect, a heroine making a difference under terrible conditions while being blocked by the intransigence of military bureaucracy, and the ‘lady with the lamp’ image sold well. The donations came rolling in.

A highly fanciful representation of Florence
(A fanciful depiction of Florence doing her rounds. Creative Commons license.)

In May 1855, Florence got closer to the Crimean War when she toured Balaclava in the Crimea itself. Unfortunately, on 13th May 1855, she collapsed through exhaustion and became gravely ill, suffering fevers and delirium. The word was, she was close to death. On hearing of her condition, it’s said the patients in the Scutari hospital turned towards the wall and wept. Florence recovered, but she continued to suffer debilitating illness for the rest of her long life.

The war finally ended on 30th March 1856, and Florence returned to England in July of the same year. She left an unknown but came back a celebrity.

Florence as a data analyst and statistician

The Crimean War was a disaster for the British military and the public was angry; the political fall-out continued after the war was over and the poor medical treatment the troops received was a hot topic. After some delay, a “Royal Commission on the Health of the Army” was formed to investigate the health of the British Army, and Florence was its powerhouse. Sadly, as a woman, she couldn't formally be appointed to the Commission, so her role was less formal. Despite the informality, she was determined to prove her points with data and to communicate clearly with the public.

In the 1850s, statistics was in its infancy, but there were some early pioneers, including Willam Farr at the General Registry Office who was an early epidemiologist and one of the founders of medical statistics. Of course, Florence was a friend of Farr’s. Farr had introduced the idea of comparing the mortality rates of different occupations, which Florence was to run with [Cohen]. He also had a dismal view of data visualization which Florence disagreed with.

Florence’s stand-out piece of work is her report “Mortality of the British Army: at home and abroad, and during the Russian war, as compared with the mortality of the civil population in England.” which was appended to the Commission's main report. She knew she needed to reach the general public who wouldn’t read a huge and dull tome, she had to make an impact quickly and clearly, and she did so through the use of tables and data visualization. Bear in mind, the use of charts was in its infancy.

Here's one of the tables from her report, it's startlingly modern in its presentation. The key column is the one on right, the excess of deaths in the army compared to the general population. The excess deaths weren't due to warfare.

Incredibly, the excess of deaths was due to disease as we can see in the table below. The death rate for the general population for 'chest and tubercular disease' was 4.5 per 1,000, but for the army, it was 10.1. Tubercular disease isn't a disease of war, it's a disease of poor living conditions and poor sanitation.

The report is full of these kinds of tables, presented in a clear and compelling way that helped tell the terrible story: the British Army was killing its own soldiers through neglect.

Of course, tables are dry; charts make a more immediate impression and Florence used bar charts to great effect. Here's a bar chart of death by age group for the British Army (red) and the general population (black). Bear in mind, the period leading up to the Crimean War was peaceful - there were no major engagements, so the excess deaths aren't battle casualties. In fact, as Florence showed in the tables and in the charts, these excess death were avoidable.

In private, Florence was more forceful about the effect of poor medical treatment on the strength of the army. Salisbury Plain was (and is), a big British Army practice area, and she said: "it is as criminal to have a mortality of 17, 19, and 20 per thousand in the Line, Artillery and Guards, when in civilian life it is on 11 per thousand as it would be to take 1,100 men every year out upon Salisbury Plain and shoot them" [Kopf].

The death toll is shocking in human terms, but it also has a profound impact in terms of the army's efficiency, fighting ability, and recruitment needs. Men dying early means a loss of experience and a continued high need for recruitment. Florence illustrated the impact of early deaths with a pair of charts I've shown below.

The chart on the left showed the effect of disease at home on the army. The chart on the right showed what would happen if death rates came down to those of the general population. If people didn't care about lives, they might care about the strength of the army and do something about medical care.

The Royal Commission wasn't the end of it. A little later, Florence produced yet another report, "Notes on matters affecting the health, efficiency, and hospital administration of the British Army: founded chiefly on the experience of the late war". This report is notable because it contains the famous coxcomb plot. If you read anything about Florence and visualization online, this is what you'll find. I'm going to take some time to explain it because it's so fundamental in the history of data visualization.

(I should note that Florence never called these plots coxcomb plots, the use of the term came far later and not from her. However, the internet calls these charts coxcomb plots and I'm going to follow the herd for now.)

The visualization takes its name from the comb on a rooster's head.

(Image credit: Lander. Source. License Creative Commons.)

There are two coxcomb plots in the report, appearing on the same pull-out page. To make it easier to understand them, I'm going to show you the two plots separately.

The plot is divided into twelve segments, one for each month from April 1854 to March 1855. The area of each segment represents the number of deaths. The red wedges are deaths from wounds, the blue (gray in the image) represents deaths from preventable diseases, and the black wedges are deaths from other causes. You can plainly see the battle deaths. But what's really shocking is the number of deaths from preventable diseases. Soldiers are dying in battle, but many more of them are dying from preventable diseases. In other words, the soldiers didn't have to die.

Here's the other part of the diagram, from April 1855 to March 1856 (the end of the war) - not to scale with the previous plot.

Interestingly, Florence preferred the coxcomb plots to bar charts because she felt they were more mathematically accurate.

Although William Farr was an advisor to Florence and involved in building the coxcomb plots, he wasn't a fan of data visualization. He advised her that 'statistics should be as dry as possible' [Bostridge]. But Florence's aim was influencing the public, not a stone-cold presentation of data. In the introduction, I said there were lessons that modern analysts could learn from Florence, and this is the key one: you have to communicate your results clearly to a general audience to influence opinion and effect change.

The lessons from Florence's analysis are very clear: the men in the British Army were dying through poor treatment. They were dying at home, and dying after battle. The disaster in the Crimea was avoidable.

The Commission had far-reaching effects, specifically, the radical restructuring of the British Army's healthcare system, including the construction of a new army hospital. Florence had firm views on hospital design, which the new hospital didn't meet. Unfortunately, by the time she was involved in the project, it was too late to change some of the design basics, but she did manage to make it less bad. Radical reform doesn't happen overnight, and that was the case here. 

Florence's friend, Lord Herbert carried out a series of reforms over many years. Unfortunately, he died 1861. Two years later, Florence published a monograph in his honor, "Army Sanitary Administration, and Its Reform under the Late Lord Herbert", which included more charts and data [McDonald]. As before, Florence's goal was communication, but this time communicating the impact her friend and collaborator had on saving lives.

Florence was famous by the 1860s, famous enough to have an early photograph taken.


Florence and nursing

Quite rightly, Florence is considered one of the founding figures of modern nursing. She wrote a short book (75 pages), called "Notes on nursing: what it is and what it is not", which was by far her most widely read publication and stayed in print for a long time. In 1860, St Thomas's hospital in London opened a nursing school with Florence as an advisor, this was the "Nightingale Training School for Nurses", which was to set the standard for nursing education.

Florence and public health

The illness she picked up in the Crimea prevented her from traveling but didn't prevent her from absorbing data and influencing public health. In 1859, she took part in a Royal Commission, the "Royal Commission on the Sanitary State of the Army in India", which aimed to do for the British Army in India what the previous Royal Commission did for the Army in Britain. Sadly, the story was the same as the Crimea, poor health leading to premature death. Once again, Florence illustrated her work with visualizations and statistics. 

This report is notable for another type of visualization: woodcut drawings. Royal Commission reports are known to be dull, worthy affairs, but Florence wanted her work to be read and she knew she had to reach a wider audience (the same lesson about communicating effectively to create change). Her relative, Hilary Bonham Carter, drew the woodcuts she included in her report. The Treasury balked at the printing costs and wanted the report without the woodcuts, but Florence knew that some people would only read the report for the woodcuts, so she insisted they be included. Her decision was the right one, by communicating clearly, she was more effective in winning reforms.

(Image source: Wikimedia Commons)

Sadly, as a woman, Florence couldn't formally be part of the Commission, despite her huge input.

To use statistics to understand what's going on requires agreement and consistency in data collection. If different authorities record illnesses differently, then there can be no comparison and no change. Florence realized the need for consistent definitions of disease and proposed a classification scheme that was endorsed by the International Statistical Congress, held in London in 1860 [Magnello]. Sadly, only a few hospitals adopted her scheme and an opportunity to improve healthcare through data was lost.

Hospital design 

In 1859, Florence's writings on hospital design were consolidated into a book 'Notes on Hospitals' which led her to become the leading authority on hospital design.  Many British cities asked her to consult on their proposed hospital-building programs, as did the Government of India, the Queen of Holland, and the King of Portugal.

Decline and death

She never enjoyed good health after the Crimea, and never again traveled far from home. In her later years, she spent her time at home with her cats, occasionally doling out nursing or public health advice. In her last few years, her mental acuity fell away, and she retreated from public life. She died in 1910, aged 90.

(Florence shortly before her death in 1910. Lizzie Caswall Smith. Source: Wikimedia Commons.)

Florence as a Victorian

Florence was very much a product of her time and her class, she wasn't a feminist icon and she wasn't an advocate for the working classes - in many ways, she was the reverse [Stanley]. I've read some quotes from her which are quite shocking to modern ears [Bostridge]. However, I'm with the historians here, we have to understand people in their context and not expect them to behave in modern ways or judge them against modern standards.

Florence’s legacy

During her life, she received numerous honors, and the honors continued after her death.

The Royal Statistical Society was founded in 1834 as the Statistical Society of London, and Florence became its first female member in 1858 and was elected a Fellow in 1859. The American Statistical Association gave her honorary membership in 1874.

The Queen’s head appears on all British banknotes, but on the other side, there’s usually someone of historical note. On the £10 note, from 1975-1992, it was Florence Nightingale, the first woman to be featured on a banknote [BoE].

(UK £10 note)

For a very long time, many British hospitals have had a Nightingale ward. Things went a step further in response to the coronavirus pandemic; the British Army turned large conference centers into emergency hospitals for the infected, for example, the ExCel Center in London was turned into a hospital in nine days. Other large conference venues in the UK were also converted. The name of these hospitals? Nightingale Hospitals.

Her legend and what it says about society

Florence Nightingale is a revered figure in nursing, and rightly so, but her fame in the UK extends beyond the medical world to the general population. She’s known as the founder of nursing, and the story of the “lady with the lamp” still resonates. But less well-known is her analysis work on soldiers’ deaths during the war, her work on hospital design, and her role in improving public health. She probably saved more lives with her work after Crimea than she did during the Crimean War. Outside of the data analytics world, her ground-breaking visualizations are largely unknown. In my view, there’s definitely gender stereotyping going on; it’s fine for a woman to be a caring nurse, but not fine for her to be a pioneering public health analyst. Who society chooses as its heroes is very telling, but what society chooses to celebrate about them is even more telling.

The takeaways for analysts

I've read a lot on Florence's coxcomb charts, but less on her use of tables, and even less on her use of woodcut illustrations. The discussions mostly miss the point; Florence used these devices as a way of communicating a clear message to a wide audience, her message was all about the need for change. The diagrams weren't the goal, they were a means to an end - she spent a lot of time thinking about how to present data meaningfully; a lesson modern analysts should take to heart.

References

[BofE] https://www.bankofengland.co.uk/museum/noteworthy-women/historical-women-on-banknotes
[Bostridge] Mark Bostridge, “Florence Nightingale The Making Of An Icon”, Farrar, Straus, and Giroux, New York, 2008
[Britannica] https://www.britannica.com/event/Crimean-War
[Cohen] I Bernard Cohen, "Florence Nightingale", Scientific American, 250(3):128-137, March 1984 
[Kopf] Edwin Kopf, "Florence Nightingale as Statistician", Publications of the American Statistical Association, Vol. 15, No. 116 (Dec., 1916), pp. 388-404
[Globalhandwashing] https://globalhandwashing.org/about-handwashing/history-of-handwashing/
[Huxley] Elspeth Huxley, “Florence Nightingale”, G.P. Putnam’s Sons, New York, 1975
[Magnello] https://plus.maths.org/content/florence-nightingale-compassionate-statistician 
[McDonald] https://rss.onlinelibrary.wiley.com/doi/10.1111/1740-9713.01374
[Nightingale 1851] Florence Nightingale, “The institution of Kaiserswerth on the Rhine, for the practical training of deaconesses”, 1851
[Stanley] David Stanley, Amanda Sherratt, "Lamp light on leadership: clinical leadership and Florence Nightingale", Journal of Nursing Management, 18, 115–121, 2010

Tuesday, March 24, 2020

John Snow, cholera, and the origins of data science

The John Snow story is so well known, it borders on the cliched, but I discovered some twists and turns I hadn't known that shed new light on what happened and on how to interpret Snow's results. Snow's story isn't just a foundational story for epidemiology, it's a foundational story for data science too.


(Image credit: Cholera bacteria, CDC; Broad Street pump, Betsy Weber; John Snow, Wikipedia)

To very briefly summarize: John Snow was a nineteenth-century doctor with an interest in epidemiology and cholera. When cholera hit London in 1854, he played a pivotal role in understanding cholera in two quite different ways, both of which are early examples of data science practices.

The first way was his use of registry data recording the number of cholera deaths by London district. Snow was able to link the prevalence of deaths to the water company that supplied water to each district. The Southwark & Vauxhall water company sourced their water from a relatively polluted part of the river Thames, while the Lambeth water company took their water from a relatively unpolluted part of the Thames. As it turned out, there was a clear relationship between drinking water source and cholera deaths, with polluted water leading to more deaths.

This wasn't a randomized control trial, but was instead an early form of difference-in-difference analysis. Difference-in-difference analysis was popularized by Card and Krueger in the mid-1990's and is now widely used in econometrics and other disciplines. Notably, there are many difference-in-difference tutorials that use Snow's data set to teach the method. 

I've reproduced one of Snow's key tables below, the most important piece is the summary at the bottom comparing deaths from cholera by water supply company. You can see the attraction of this dataset for data scientists, it's calling out for the use of groupby.

The second way is a more dramatic tale and guaranteed his continuing fame. In 1854, there was an outbreak of cholera in the Golden Square part of Soho in London. Right from the start, Snow suspected the water pump at Broad Street was the source of the infection. Snow conducted door-to-door inquiries, asking what people ate and drank. He was able to establish that people who drank water from the pump died at a much higher rate than those that did not. The authorities were desperate to stop the infection, and despite the controversial nature of Snow's work, they listened and took action; famously, they removed the pump handle and the cholera outbreak stopped.

Snow continued his analysis after the pump handle was removed and wrote up his results (along with the district study I mentioned above) in a book published in 1855. In the second edition of his book, he included his famous map, which became an iconic data visualization for data science. 

Snow knew where the water pumps were and knew where deaths had occurred. He merged this data into a map-bar chart combination; he started with a street map of the Soho area and placed a bar for each death that occurred at an address. His map showed a concentration of deaths near the Broad Street pump.

I've reproduced a section of his map below. The Broad Street pump I've highlighted in red and you can see a high concentration of deaths nearby. There are two properties that suffered few deaths despite being near the pump, the workhouse and the brewery. I've highlighted the workhouse in green. Despite housing a large number of people, few died. The workhouse had its own water supply, entirely separate from the Broad Street pump. The brewery (highlighted in yellow) had no deaths either; they supplied their workers with free beer (made from boiled water).


(Source: adapted from Wikipedia)

I've been fascinated with this story for a while now, and recent events caused me to take a closer look. There's a tremendous amount of this story that I've left out, including:

  • The cholera bacteria and the history of cholera infections.
  • The state of medical knowledge at the time and how the prevailing theory blocked progress on preventing and treating cholera.
  • The intellectual backlash against John Snow.
  • The 21st century controversy surrounding the John Snow pub.

I've written up the full story in a longer article you can get from my website. Here's a link to my longer article.


Tuesday, March 3, 2020

Cheating charts: the axes of evil

As you might have guessed from the title, this post is all about how you can play around with chart axes to lie like truth. It's about being evil with axes.

In the Harry Potter books, the children are taught 'Defence Against the Dark Arts' not to teach them how to be evil, but rather to teach them how to defend against evil. I'm using the same approach here; this blog post is about defending yourself against being misleading or being misled.  I'm going to show you ways that people have used chart axes to obscure the truth. But we need to be careful with blame; sometimes, charts are unintentionally deceitful, the author miscommunicated rather than set out to misinform, and sometimes it's a matter of opinion. Read what I have to say and decide for yourself.


(2x2 matrix - an example of evil axes)

Zero axis

In most cases, charts should include zero so as not to mislead about the size of an effect. Let's take house prices in London as our example. UK inflation (CPI) was 1.8% for the twelve months from January 2019 to January 2020, over the same period, London house prices increased 2.8% - not a bad increase, but we can make it look much larger.

Let's start with an honest chart.

It clearly shows a small increase, but it would be hard to get a newspaper headline from it. Imagine you were a newspaper editor and you needed to squeeze a sensationalist story from the data. You need to make the difference appear much bigger, but still have a fig leaf of decency. How can you do it? The simplest way is excluding zero and zooming in.

Imagine that we coupled it with a headline like, 'London Property Market Booms' and had an article with examples of extremely expensive houses and some anecdotes of house buying. If you just glanced at the chart and read the story, you might think the market was growing explosively. This trick works even better if you make the axes text small, reduce their contrast with the background color, or even remove them altogether.

If you're trying to be honest, most of the time, you should include zeros to truly scale the effect and not mislead. But there are exceptions. Sometimes you do want to exclude zero as in the example below.

I have some data on human body temperature over the course of a day, taken from Wikipedia. Here's a chart including zero (as in 0 centigrade).

There really doesn't seem to be much variation does there? It looks like the human body temperature stays more or less constant during the day. In fact, the data looks just like noise. I could flatten the chart further by using degrees Kelvin or even showing a Fahrenheit scale starting from zero. 

When we zoom in and exclude zero, a clearer picture emerges.

Plainly, human body temperature does change during the day. Given the fact that a few degrees difference in body temperature can make the difference between someone who's fine and someone who's in medical danger, the second chart is a better and more honest and useful representation.

If you want to cheat and misrepresent, here's what you should do:

  • If you want to exaggerate a small difference, don't include zero and zoom into your chart to expand the difference.
  • If you want to suppress a difference, include zero and choose units that minimize the difference.

If you want to be honest:

  • Include zero by default.
  • Don't include zero when you're looking at small changes and the changes matter, in this case, exclude zero to focus on the change.

Extending the axis

This is a really fun way to mislead people and it's something I've only seen recently. You can extend the perception of the axis to reduce the effect. Let's use the same election example I used in my blog post on pie (lie) charts. Imagine there are four parties standing in an election and you have a record of what percentage of the vote each candidate and party received. Here's an honest bar chart showing the results.

Plainly, the Bird party did very badly (15% of the vote). Now let's see if we can minimize the scale of their defeat by redrawing the chart in a deceitful way. Let's remove the x-axis, extend the y-axis labels, color and box the labels, and introduce some bar coloring.

It's still obviously a defeat, but we've made it look much smaller. If you take the time to look, it's obvious that something funky is going on here, but most people don't have time and don't look closely.

If you want to be honest, don't play around with axis labels and colors.

Unequal steps

If you want to imply things are getting worse, or better, when they're not, then a good option is to use unequal axis scaling. Most viewers expect that an axis will scale consistently, for example, an axis might be labeled 1, 2, 3, 4, 5, 6, 7, 8, 9, 10  and more sophisticated viewers might be very comfortable with log scales, for example, axis labels 1,10, 100, 1000. Almost no one can interpret what unequal scaling means, which makes it great for evil. To make your deception even better, use a line chart (which implies continuity) rather than a bar chart (which implies category).

Let's take an example that appeared in the media, US gas prices in 2012. The AAA produces a daily set of gas price data. This has today's price, yesterday's price, last week's price, last month's price, and last year's price. It's not the greatest presentation of data and it's hard to pick out trends, but at least the data exists - and more importantly, they don't chart it. In 2012, a US media outlet (who shall remain nameless) took the data and ran a story on gas price increases under Obama. Here's my version of their chart.


At a quick glance, it looks like there was a massive increase. But was there? The periods on the x-axis aren't equal and they've used a line to indicate a continuous variable. The AAA data quotes last month's number, but that isn't shown here, why? The y axis starts at $2.80 which is an odd choice, more rational choices might have been $3.00 or $0. If you take the time to look at the chart, it's really hard to draw any conclusions, but most people don't have the time and will just conclude 'gas prices up under Obama'.

If you really want to mislead, use unequal scaling and a line chart.

Scale inversion

If you really, really want to mislead, choose a scale inversion. 

I'm going to show you one of the most controversial charts of the last ten years. The author has vigorously defended their work, and after reading their comments, I understand that they had no intention to deceive. Because I don't wish to make the author's life more difficult, I'm not going to name them or give you their employer's name.

The chart below shows homicides in Florida and what happened when the 'Stand Your Ground' Law was enacted. Before reading on, how would you interpret the chart?


Almost everyone I've spoken to interprets the chart as implying that homicides went down. But look at the y axis. It's inverted. Here's how the plot would look if the author had chosen normal scaling.


This conveys a hugely different message.

The author wasn't trying to mislead here, rather they were trying to use art to make a more emotionally informative representation of the data. You can judge for yourself whether they succeeded or not. This raises the more general topic of who is visualizing data and how it's done. 

In the last few years, there's been a tremendous rise in the use of infographics for all kinds of topics. These tend to be more poster art than information sharing, which leads us to a problem. In the information world, a large number of informal practices have grown up around how to display data in a truthful way. Infographics are sometimes created by people familiar with these practices, but sometimes not. When designers start using artist interpretation to make data more impactful, we can get distortions and unintentionally misleading people. Personally, I think infographics are little more than visual fluff.

Getting back to where I started in this section, scale inversion is a wonderful way of reversing the evidence.

Log plots

This isn't so much deceit as obfuscation or confusion.  

A logarithmic scale is one that varies logarithmically, so instead of an axis increasing like 1,2,3,4,5, it increases like 1, 10, 100, 1000, 10000. Logarithmic scales are used when data varies by orders of magnitude. 

Unfortunately, many viewers aren't familiar with the idea and it can be hard to interpret, a good example being the recent coronavirus chart in a New York Times article. Here's the chart:



(Imaged credit: New York Times, copyright New York Times)

The logarithmic axis is the y axis. What conclusions would you draw about the coronavirus from this chart? I've used log plots for years and I struggled to understand what this chart means. 

2x2 charts

2x2 charts are a special case of confusion with axis. Unfortunately, they're beloved of MBA courses and books on management and marketing. Let's take the classic BCG product matrix as an example. In the 1960s, the consulting company BCG came up with a way for companies to view their product portfolio and make more rational product investment decisions. They recommended plotting market share on the x-axis, growth on the y axis, and dividing the plot into four quadrants, each with a name, you can read more about it here. Here's a representation of their matrix.

Note that although the axes are marked, there's no scale and it's not clear where the quadrant lines are drawn. In practice, companies using this methodology may well draw scales, but in almost all cases you find on the internet, there are no scales.

The BCG matrix is just one of a large number of 2x2 matrices you can find out there. Very few of them have any kind of scale, so it's very hard to understand and interpret what they mean in practice. Bear in mind that they often imply quite different management choices for different chart quadrants, but who's in what quadrant may depend on exactly where the quadrant boundaries are drawn, and that's almost never made clear. It's really tempting to say that you need to employ consultants to tell you what they mean and to interpret the charts for you.

I'm not a fan of 2x2 matrices because I find that they confuse rather than enlighten, but if you want to produce a chart that looks pretty and requires you to interpret it for your management, a 2x2 matrix might well be the place to go.

You can fool all the people some of the time and some of the people all the time

If you know what you're looking for, you can see through deceit or malpractice with some effort. But if you're in a hurry, not paying attention, or a chart is flashed on the screen for a short period of time, a chart with evil axes will probably slip by your defenses against the dark arts.

In many ways, playing around with chart axes is one of the easiest ways to mislead people. I've shown you how people have been evil with axes in the hope that you'll be truthful and honest in your own visualizations.

I'd love to hear what you think about the 'axes of evil'. Have you come across other axis manipulations that I haven't included here?

Thursday, February 27, 2020

Pie charts are lie charts

There are lots of chart types, but if you want to lie or mislead people, the best chart to use is the pie chart. I’m going to show you how to distort reality with pie charts, not so you can be a liar, but so you know never to use pie charts and to choose more honest visualizations.

Let's start with the one positive thing I know about pie charts: they're called camembert charts in France and cake charts in Germany. On balance, I prefer the French term, but we're probably stuck with the English term. Unlike camembert, pie charts often leave a bad taste in my mouth and I'll show you why.


(Camembert cheese - image credit: Coyau, Wikipedia - license : Creative Commons)

Take a look at the pie chart below. Can you put the six slices in order from largest to smallest? What percentages do you think the slices represent?



Here’s how I’ve misled you:

  • Offset the slices from the 12 o’clock position to make size comparison harder. I've robbed you of the convenient 'clock face' frame of reference.
  • Not put the slices in order (largest to smallest). Humans are bad at judging the relative sizes of areas and by playing with the order, I'm making it even harder.
  • Not labeled the slices. This ought to be standard practice, but shockingly often isn't.
The actual percentages are:
Gray20.9
Green17.5
Light blue16.8
Dark blue16.1
Yellow15.4
Orange13.3

How close were you? How good was my attempt to deceive you?

Let’s use a bar chart to represent the same data.



Simple, clear, unambiguous.

I've read guidance that suggests you should only use a pie chart if you're showing two quantities that are obviously unequal. This gives the so-called pac-man pie charts. Even here, I think there are better representations, and our old-friend the bar chart would work better (albeit less interestingly).


Now let’s look at the king of deceptive practices, the 3d pie chart. This one is great because you can thoroughly mislead while still claiming to be honest. I’m going to work through a short deceptive example.

Let’s imagine there are four political parties standing in an election. The percentage results are below.
Dog36
Cat28
Mouse21
Bird15

You work for Bird, which unfortunately got the lowest share of the vote. Your job is to deceive the electorate into thinking Bird did much better than they did.

You can obscure the result by showing it as a pie chart without number labels. You can even mute the opposition colors to fool the eye. But you can go one better. You can create a 3d pie chart with shifted perspective and 'point explosion' using the data I gave above like so.

Here's what I did to create the chart:

  • Took the data above as my starting point and created a pie chart.
  • Rotated the chart so my slice was at the bottom.
  • Made the pie chart 3d.
  • Changed the perspective to emphasize my party.
  • Used 'point explosion' to pull my slice out of the main body of the chart to emphasize it.
  • Used shading.

This now makes it look like Bird was a serious contender in the election. The fraction of the chart area taken up with the Bird party’s color is completely disproportionate to their voter share. But you can claim honesty because the slice is still the correct proportion if the chart was viewed from above. If challenged, you can turn it into a technical/academic debate about data visualization that will turn off most people and make your opponents sound like they’re nit-picking.

You don’t have to go this far to mislead with a pie chart. All you have to do is increase the cognitive burden to interpret a chart. Some, maybe even all, of your audience might not spot what you’re trying to hide because they’re in a hurry. You can mislead some of your audience all of the time.

I want to be clear, I'm telling you about these deceptive practices so you can avoid them. There are good reasons why honest analysts don’t use pie charts. In fact, I would go one stage further; if you see a pie chart, be on your guard against dishonesty. As one of my colleagues used to say, ‘friends don’t let friends use pie charts’.

Tuesday, January 28, 2020

Future directions for Python visualization software

The Python charting ecosystem is highly fragmented and still lags behind R, it also lacks some of the features of paid-for BI tools like Tableau or Qlik. However, things are slowly changing and the situation may be much better in a few years' time.



Theoretically, the ‘grammar of graphics’ approach has been a substantial influence on visualization software. The concept was introduced in 1999 by Leland Wilkinson in a landmark book and gained widespread attention through Hadley Wickham’s development of ggplot2  The core idea is that a visualization can be represented as different layers within a framework, with rules governing the relationship between layers. 

Bokeh was influenced by the 'grammar of graphics' concept as were other Python charting libraries. The Vega project seeks to take the idea of the grammar of graphics further and creates a grammar to specify visualizations independent of the visualization backend module. Building on Vega, the Altair project is a visualization library that offers a different approach from Bokeh to build charts. It’s clear that the grammar of graphics approach has become central to Python charting software.

If the legion of charting libraries is a negative, the fact that they are (mostly) built on the same ideas offers some hope for the future. There’s a movement to convergence by providing an abstraction layer above the individual libraries like Bokeh or Matplotlib. In the Python world, there’s precedence for this; the database API provides an abstraction layer above the various Python database libraries. Currently, the Panel project and HoloViews are offering abstraction layers for visualization, though there are discussions of a more unified approach.

My take is, the Python world is suffering from having a confusing array of charting library choices which splits the available open-source development efforts across too many projects, and of course, it confuses users. The effort to provide higher-level abstractions is a good idea and will probably result in fewer underlying charting libraries, however, stable and reliable abstraction libraries are probably a few years off. If you have to produce results today, you’re left with choosing a library now.

The big gap between Python and BI tools like Tableau and Qlik is the ease of deployment and speed of development. BI tools reduce the skill level to build apps, deploy them to servers, and manage tasks like access control. Projects like Holoviews may evolve to make chart building easier, but there are still no good, easy, and automated deployment solutions. However, some of the component parts for easier deployment exist, for example, Docker, and it’s not hard to imagine the open-source community moving its attention to deployment and management once the various widget and charting issues of visualization have been solved.

Will the Python ecosystem evolve to be as good as R’s and be good enough to take on BI tools? Probably, but not for a few years. In my view, this evolution will happen slowly and in public (e.g. talks at PyCon, SciPy etc.). The good news for developers is, there will be plenty of time to adapt to these changes.

Saturday, January 25, 2020

How to lie with statistics

I recently re-read Darrell Huff's classic text from 1954, 'How to lie with statistics'. In case you haven't read it, the book takes a number of deceitful statistical tricks of the trade and explains how they work and how to defend yourself from being hoodwinked. My overwhelming thought was 'plus ça change'; the more things change, the more they remain the same. The statistical tricks people used to mislead 50 years ago are still being used today.



(Image credit: Wikipedia)

Huff discusses surveys and how very common methodology flaws can produce completely misleading results. His discussion of sampling methodologies and the problems with them are clear and unfortunately, still relevant. Making your sample representative is a perennial problem as the polling for the 2016 Presidential election showed. Years ago, I was a market researcher conducting interviews on the street and Huff's bias comments rang very true with me - I faced these problems on a daily basis. In my experience, even people with a very good statistical education aren't aware of survey flaws and sources of bias.

The chapter on averages still holds up. Huff shows how the mean can be distorted and why the median might be a better choice. I've interviewed people with Master's degrees in statistics who couldn't explain why the median might be a better choice of average than the mean, so I guess there's still a need for the lesson.

One area where I think things have moved in the right direction is the decreasing use of some types of misleading charts. Huff discusses the use of images to convey quantitative information. He shows a chart where steel production was represented by images of a blast furnace (see below). The increase in production was 50%, but because the height and width were both increased, the area consumed by the images increases by 150%, giving the overall impression of a 150% increase in production1. I used to see a lot of these types of image-based charts, but their use has declined over the years. It would be nice to think Huff had some effect.



(Image credit: How to lie with statistics)

Staying with charts, his discussion about selecting axis ranges to mislead still holds true and there are numerous examples of people using this technique to mislead every day. I might write a blog post about this at some point.

He has chapters on the post hoc fallacy (confusing correlation and causation) and has a nice explanation of how percentages are regularly mishandled. His discussion of general statistical deceitfulness is clear and still relevant.

Unfortunately, the book hasn't aged very well in other aspects. 2020 readers will find his language sexist, the jokey drawings of a smoking baby are jarring, and his roundabout discussion of the Kinsey Reports feels odd. Even the writing style is out of date.

Huff himself is tainted; he was funded by the tobacco industry to speak out against smoking as a cause of cancer. He even wrote a follow-up book, How to lie with smoking statistics to debunk anti-smoking data. Unfortunately, his source of authority was the widespread success of How to lie with statistics. How to lie with smoking statistics isn't available commercially anymore, but you can read about it on Alex Reinhart's page.

Despite all its flaws, I recommend you read this book. It's a quick read and it'll give you a grounding in many of the problems of statistical analysis. If you're a business person, I strongly recommend it - its lessons about cautiously interpreting analysis still hold.

This is a flawed book by a flawed author but it still has a lot of value. I couldn't help thinking that the time is probably right for a new popular book on how people are lying and misleading you using charts and statistics.

Correction

[1] Colin Warwick pointed out an error in my original text. My original text stated the height and width of the second chart increased by 50%. That's not quite what Huff said. I've corrected my post.