To very briefly summarize: John Snow was a nineteenth-century doctor with an interest in epidemiology and cholera. When cholera hit London in 1854, he played a pivotal role in understanding cholera in two quite different ways, both of which are early examples of data science practices.
The first way was his use of registry data recording the number of cholera deaths by London district. Snow was able to link the prevalence of deaths to the water company that supplied water to each district. The Southwark & Vauxhall water company sourced their water from a relatively polluted part of the river Thames, while the Lambeth water company took their water from a relatively unpolluted part of the Thames. As it turned out, there was a clear relationship between drinking water source and cholera deaths, with polluted water leading to more deaths. This wasn't a randomized control trial, but was instead an early form of difference-in-difference analysis. Difference-in-difference analysis was popularized by Card and Krueger in the mid-1990's and is now widely used in econometrics and other disciplines. Notably, there are many difference-in-difference tutorials that use Snow's data set to teach the method. I've reproduced one of Snow's key tables below, the most important piece is the summary at the bottom comparing deaths from cholera by water supply company. You can see the attraction of this dataset for data scientists, it's calling out for the use of group by.
The second way is a more dramatic tale and guaranteed his continuing fame. In 1854, there was an outbreak of cholera in the Golden Square part of Soho in London. Right from the start, Snow suspected the water pump at Broad Street was the source of the infection. Snow conducted door-to-door inquiries, asking what people ate and drank. He was able to establish that people who drank water from the pump died at a much higher rate than those that did not. The authorities were desperate to stop the infection, and despite the controversial nature of Snow's work, they listened and took action; famously, they removed the pump handle and the cholera outbreak stopped.
Snow continued his analysis after the pump handle was removed and wrote up his results (along with the district study I mentioned above) in a book published in 1855. In the second edition of his book, he included his famous map, which became an iconic data visualization for data science. Snow knew where the water pumps were and knew where deaths had occurred. He merged this data into a map-bar chart combination; he started with a street map of the Soho area and placed a bar for each death that occurred at an address. His map showed a concentration of deaths near the Broad Street pump.
I've reproduced a section of his map below. The Broad Street pump I've highlighted in red and you can see a high concentration of deaths nearby. There are two properties that suffered few deaths despite being near the pump, the workhouse and the brewery. I've highlighted the workhouse in green. Despite housing a large number of people, few died. The workhouse had its own water supply, entirely separate from the Broad Street pump. The brewery (highlighted in yellow) had no deaths either; they supplied their workers with free beer (made from boiled water).
(Source: adapted from Wikipedia)
I've been fascinated with this story for a while now, and recent events caused me to take a closer look. There's a tremendous amount of this story that I've left out, including:
- The cholera bacteria and the history of cholera infections.
- The state of medical knowledge at the time and how the prevailing theory blocked progress on preventing and treating cholera.
- The intellectual backlash against John Snow.
- The 21st century controversy surrounding the John Snow pub.
I've written up the full story in a longer article you can get from my website. Here's a link to my longer article.