Tuesday, March 24, 2020

John Snow, cholera, and the origins of data science

The John Snow story is so well known, it borders on the cliched, but I discovered some twists and turns I hadn't known that shed new light on what happened and on how to interpret Snow's results. Snow's story isn't just a foundational story for epidemiology, it's a foundational story for data science too.

(Image credit: Cholera bacteria, CDC; Broad Street pump, Betsy Weber; John Snow, Wikipedia)

To very briefly summarize: John Snow was a nineteenth-century doctor with an interest in epidemiology and cholera. When cholera hit London in 1854, he played a pivotal role in understanding cholera in two quite different ways, both of which are early examples of data science practices.

The first way was his use of registry data recording the number of cholera deaths by London district. Snow was able to link the prevalence of deaths to the water company that supplied water to each district. The Southwark & Vauxhall water company sourced their water from a relatively polluted part of the river Thames, while the Lambeth water company took their water from a relatively unpolluted part of the Thames. As it turned out, there was a clear relationship between drinking water source and cholera deaths, with polluted water leading to more deaths.

This wasn't a randomized control trial, but was instead an early form of difference-in-differences analysis. Difference-in-differences analysis was popularized by Card and Krueger in the mid-1990's and is now widely used in econometrics and other disciplines. Notably, there are many difference-in-differences tutorials that use Snow's data set to teach the method.

I've reproduced one of Snow's key tables below, the most important piece is the summary at the bottom comparing deaths from cholera by water supply company. You can see the attraction of this dataset for data scientists, it's calling out for the use of groupby.

The second way he understood cholera is a more dramatic tale and guaranteed his continuing fame. In 1854, there was an outbreak of cholera in the Golden Square part of Soho in London. Right from the start, Snow suspected the water pump at Broad Street was the source of the infection. Snow conducted door-to-door inquiries, asking what people ate and drank. He was able to establish that people who drank water from the pump died at a much higher rate than those that did not. The authorities were desperate to stop the infection, and despite the controversial nature of Snow's work, they listened and took action; famously, they removed the pump handle and the cholera outbreak stopped.

Snow continued his analysis after the pump handle was removed and wrote up his results (along with the district study I mentioned above) in a book published in 1855. In the second edition of his book, he included his famous map, which became an iconic data visualization for data science.

Snow knew where the water pumps were and knew where deaths had occurred. He merged this data into a map-bar chart combination; he started with a street map of the Soho area and placed a bar for each death that occurred at an address. His map showed a concentration of deaths near the Broad Street pump.

I've reproduced a section of his map below. The Broad Street pump I've highlighted in red and you can see a high concentration of deaths nearby. There are two properties that suffered few deaths despite being near the pump, the workhouse and the brewery. I've highlighted the workhouse in green. Despite housing a large number of people, few workhouse residents died. The workhouse had its own water supply, entirely separate from the Broad Street pump. The brewery (highlighted in yellow) had no deaths either; they supplied their workers with free beer (made from boiled water).

(Source: adapted from Wikipedia)

I've been fascinated with this story for a while now, and recent events caused me to take a closer look. There's a tremendous amount of this story that I've left out, including:

The cholera bacteria and the history of cholera infections.
The state of medical knowledge at the time and how the prevailing theory blocked progress on preventing and treating cholera.
The intellectual backlash against John Snow.
The 21st century controversy surrounding the John Snow pub.

I've written up the full story in a longer article you can get from my website. Here's a link to my longer article.

Wednesday, March 18, 2020

Contributing to open-source software

I’ve been using open-source software packages for several years and always felt like a bit of a freeloader; I took, but I never gave back. My excuse was, I didn’t have time to dig into the codebase and familiarize myself with the project's ways of working. But recently, I found easier ways to contribute, and I have been.

(Image credit: Old Book Illustrations)

The first way I found is raising bugs. I’ve pushed open-source software quite hard and found bugs in Pandas and Bokeh. Both of these projects have Github pages and both of them have pages to report bugs. If you’re going to report a bug, here are some rules to follow:

Make sure you’re using the most up-to-date version of the software.
Make sure your bug hasn’t been raised before.
Provide a simple example to duplicate the bug.
Follow the rules for reporting bugs - especially with regard to formatting your report, the heading you use, and any tags.

The open-source community has quite rightly been criticized for occasional toxic behavior, some of which has come from software users. I’ve seen people raise bugs and been quite forceful in their criticisms of the software they’re freely using. Ultimately, open-source software is a volunteer effort and people don’t volunteer to face some of the nastiness I’ve seen. The onus is on you to remain courteous and professional, and part of that is taking the short amount of time to follow the rules. A little kindness and consideration goes a long way.

For reference, here are some bug reports I’ve raised:

Pandas: 20027
Bokeh: 8009, 8042

The second way to contribute is by suggesting new functionality. This is a little harder because it takes more consideration to make sure what you’re suggesting is relevant and hasn’t been suggested before. Once again, I strongly advocate that you find out what the rules are for requesting new functionality. If possible, I suggest you include a mock-up of what you’re suggesting.

For reference, here are some suggestions I’ve raised:

Pandas: 21551
Bokeh: 9144

The final way of contributing is to build a project that uses open-source technology, share it via Github (or the alternatives), and notify the community of your project. Bokeh has a nice showcase section on its Discourse server where you can see what people have built. Seeing what others have built is a great way to get inspiration for your own projects.

For reference, here’s a showcase project I made available for Bokeh.

On the whole, I’ve been very pleased with the response of the developer communities to my meager contributions. Most of my errors or suggestions have been implemented within a few months, which contrasts with my experience with paid-for software where there often isn’t a forum to view bugs or make suggestions.

If you’re a user of open-source software, I urge you to contribute in any way you can. We’re all in this together.

Saturday, March 14, 2020

Niche knowledge and power - knowledge hoarding

A couple of times in my career, I’ve come across people using a strategy to gain short-term power: keeping knowledge and skills to themselves, otherwise known as knowledge hoarding. Unfortunately for them, it doesn’t work in the long term anymore. I'm going to start with some examples, then suggest how you might differentiate between an area that's genuinely hard and when someone's knowledge hoarding, before finally giving you some suggestions on what to do if you find it on your team.

(Keeping knowledge to yourself is like caging kittens. Image credit: Chameleon, source: Wikimedia Commons, License: GNU Free Documentation)

I worked with someone who had developed some in-depth knowledge of a particular technology. I needed his help with a project and I needed his in-depth knowledge. He wouldn’t share what he knew, claiming that the technology was highly complex and difficult to understand. He insisted that he had to do the work if it was done at all, and that I needed to tell his manager how valuable he was. I later heard from others in the organization that he’d taken the same approach and that they’d caved in to him. Some managers started to believe that the technology really was as complex as he said. Fortunately, I knew enough to get started without him. After some diligent Internet searching, I found what I needed and completed my project without his assistance. Unfortunately for my colleague, not too long after this, there were a couple of books published on the technology, which turned out to be much more straightforward than he claimed. His unique knowledge disappeared and his boast of enhanced value to the company evaporated within a few months. His career subsequently stalled; he was relying on his ring-fenced knowledge to give him an advantage and his prior behavior came back to haunt him.

Much later in my career, when I was older and wiser, I came across something similar. Someone two years out of college was working on a commercial tool we’ll call X. X had a sort-of programming language that enabled customization. The recent graduate claimed that only he could understand the language and only he could make the changes the company needed. This time, I didn’t even bother pursuing it. I had my team completely bypass him using a different technology. Had the recent graduate been more open, I would have gladly included them on my project and they would have been cross-trained in other technologies, instead, they ended up leaving to go to a company that used X. Problem is, X has a very small market share (< 5%) and it’s shrinking.

Over the years, I’ve seen the same story play out a number of times. Someone has knowledge of a system (e.g. Salesforce, CRMs, cloud systems, BI, sales analytics, network cards) and claims the area is too complex or difficult for others to understand and that they need to be the point person. But it always turns out not to be true. The situation resolves itself by the person leaving, or a reorganization, or a technology change or something else - it always turns out that the person was never as vital as they claimed. Ring-fencing knowledge to protect your position seems to work in the short term, but it fails spectacularly in the long run.

Of course, there are skills that are difficult to acquire and do provide a barrier to entry into an area. Good examples are statistical analysis, machine learning, and real-time system design. What’s noticeable about all of these areas is the large amount of training content freely and easily available. If you want to learn statistics, there are hundreds of online courses and books you can use. The only impediment is your ability to understand and apply theory and practice.

As a manager, how do you know if someone is ring-fencing knowledge to protect their position versus the area actually being hard? Here are the signs they might be ring-fencing:

Claims that only they can understand the technology.
Knowledge hoarding and refusal/reluctance to share.
Refusing/reluctance to brief or train others - or doing it very badly.
No formal qualification in the area (not all areas have formal qualifications though) and no formal background (e.g. degree in software engineering).
Other groups in the company or outside the company not reporting the area is hard.

Here are signs the area is actually hard:

Other, similar groups in the company or elsewhere reporting the area is hard.
Online commentary that the technology is hard.
Lots of online content to help people learn.
Definite technical requirements, like the ability to understand number theory.
Obvious qualifications, e.g. network engineering certification.

From a management perspective, the best thing to do is stop knowledge-hoarding behavior before it starts. Ideally, there shouldn’t be a single point of failure on your team (a bus factor of more than 1). This means consciously focusing on cross-training (something the military does very well). If you inherit someone showing this behavior, you need to make cross-training a priority and personally intervene to make sure it’s done properly. Cross-training will involve some loss of status for the person which you need to be sensitive to and manage well.

For some people, keeping skills and knowledge to themselves makes perfect sense, it’s a great way to enhance their value to their employer. For their colleagues, it’s not good behavior, and for their manager, it can be disastrous. For (almost) everyone’s sake, you should deal with it if it’s happening in your organization.

Wednesday, March 11, 2020

Benford's Law: finding fraud and data oddities

What links fraud detection, old-fashioned log tables, and error detection in data feeds? Benford’s Law provides the link and I'll show you what it is and how you might use it.

Imagine I gave you thousands of invoices and asked you to record the first digit of the amount. Out of say, 10,000 invoices, how many would you expect to start with the number 1, how many with the number 2, and so on? Naively, you might expect 1,111 to start with a 1; 1,111 to start with a 2 and so on. But that’s not what happens in the real world. 1 occurs more often than 2, which occurs more often than 3, and so on.

The Benford’s Law story starts in 1881, when Simon Newcomb, an astronomer, was using some mathematical log tables. For those of you too young to know, these are tables of the logarithms of numbers, very useful in pre-calculator days. Newcomb noticed that the pages for logarithms beginning 1 were more well-thumbed than the other pages, indicating that people were looking for the logarithms of some numbers more than others. Being an academic, he published a paper on it.

In 1938, a physicist called Frank Benford looked at a number of datasets and found the same relationship between the first digits. For example, he looked at the first digit of addresses and found that 1 occurred more frequently than 2, which occurred more frequently than 3 and so on. He didn't just look at addresses, he looked at the first digit of physical constants, the surface area of rivers, and numbers in the Reader's Digest etc. Despite being the second person to discover this relationship, the law is named after him and not Newcomb.

It turns out, we can mathematically describe Benford’s Law as:

P(d) = log(1 + (1/d))

Where d is the numbers 1 to 9 and P(d) is the probability of the number occurring. If we plot it out we get:

This means that for some datasets we expect the first digit to be one 30.1% of the time, the second digit to be two 17.6% of the time, three to be the first digit 12.5% of the time, etc.

The why of Benford’s Law is much too complex for this blog post. It was only recently (1998) proved by Hill [Hill] and involves digging into the central limit theorem and some very fundamental statistical and probability concepts.

Going back to my accounting example, it would seem all we have to do is plot the distribution for our invoice data and compare it to Benford’s Law. If there’s a difference, then there’s fraud. But the reality is, things are more complex than that.

Benford’s Law doesn’t apply everywhere, there are some conditions:

The data set must vary over several orders of magnitude (e.g. from 1 to 1,000)
The data set must have dimensions, or units. For example, Euros, or mm.
The mean is greater than the median and the skew is positive.

Collins provides a nice overview of how it can be used to detect accounting fraud [Collins]. But Linville [Linville] has poked some practical holes in its use. He conducted an experiment using graduate students to create fake test invoices (this was a research exercise, not an attempt at fraud!) that were mixed in with simulated invoice data. He found that if the fake invoices were less than 10% or so of the total dataset, the deviations from Benford’s Law were too small to be reliably detected.

Benford’s Law actually applies to all digits, not just the first. We can plot out an expected distribution for two digits as I’ve shown below. This has also been used for fraud detection as you might expect.

You can use Benford's Law to detect errors in incoming data. Let's say you have a datafeed of user addresses. You know the house numbers should obey Benford's Law, so you can work out the distribution the data actually has and compare it to the theoretical Benford's Law distribution. If the difference is above some threshold, you can set an alert. Bear in mind, it's not just addresses that follow the law, other properties of a data feed may too. A deviation from Benford"s Law doesn't tell you which particular items are wrong, but you do get a clue about which category, for example, you might discover items starting with a 2 are too frequent. This is a special case of using the deviation of real data from an expected distribution as an error detection mechanism - a very useful data quality assurance method everyone should be using.

To truly understand Benford’s Law, you’ll need to dig deeply into statistics and possibly number theory, but using it is relatively straightforward. You should be aware it exists and know its limitations - especially if you’re looking for fraud.

References

[Collins] J. Carlton Collins, “Using Excel and Benford’s Law to detect fraud”, https://www.journalofaccountancy.com/issues/2017/apr/excel-and-benfords-law-to-detect-fraud.html
[Hill] Hill, T. P. "The First Digit Phenomenon." Amer. Sci. 86, 358-363, 1998.
[Linville] “The Problem Of False Negative Results In The Use Of Digit Analysis”, Mark Linville, The Journal of Applied Business Research, Volume 24, Number 1

References

[Atkinson] "Our Masters' Voices: The Language and Body-language of Politics", Max Atkinson, 1984

[Heritage et al] "Generating Applause A study of Rhetoric and Response at Party Political Conferences", http://www.sscnet.ucla.edu/soc/faculty/heritage/Site/Publications_files/APPLAUSE.pdf

Reading more

This blog post is one in a series of posts on practical rhetoric. Here's the series:

Tuesday, March 3, 2020

Cheating charts: the axes of evil

As you might have guessed from the title, this post is all about how you can play around with chart axes to lie like truth. It's about being evil with axes.

In the Harry Potter books, the children are taught 'Defence Against the Dark Arts' not to teach them how to be evil, but rather to teach them how to defend against evil. I'm using the same approach here; this blog post is about defending yourself against being misleading or being misled. I'm going to show you ways that people have used chart axes to obscure the truth. But we need to be careful with blame; sometimes, charts are unintentionally deceitful, the author miscommunicated rather than set out to misinform, and sometimes it's a matter of opinion. Read what I have to say and decide for yourself.

(2x2 matrix - an example of evil axes)

Zero axis

In most cases, charts should include zero so as not to mislead about the size of an effect. Let's take house prices in London as our example. UK inflation (CPI) was 1.8% for the twelve months from January 2019 to January 2020, over the same period, London house prices increased 2.8% - not a bad increase, but we can make it look much larger.

Let's start with an honest chart.

It clearly shows a small increase, but it would be hard to get a newspaper headline from it. Imagine you were a newspaper editor and you needed to squeeze a sensationalist story from the data. You need to make the difference appear much bigger, but still have a fig leaf of decency. How can you do it? The simplest way is excluding zero and zooming in.

Imagine that we coupled it with a headline like, 'London Property Market Booms' and had an article with examples of extremely expensive houses and some anecdotes of house buying. If you just glanced at the chart and read the story, you might think the market was growing explosively. This trick works even better if you make the axes text small, reduce their contrast with the background color, or even remove them altogether.

If you're trying to be honest, most of the time, you should include zeros to truly scale the effect and not mislead. But there are exceptions. Sometimes you do want to exclude zero as in the example below.

I have some data on human body temperature over the course of a day, taken from Wikipedia. Here's a chart including zero (as in 0 centigrade).

There really doesn't seem to be much variation does there? It looks like the human body temperature stays more or less constant during the day. In fact, the data looks just like noise. I could flatten the chart further by using degrees Kelvin or even showing a Fahrenheit scale starting from zero.

When we zoom in and exclude zero, a clearer picture emerges.

Plainly, human body temperature does change during the day. Given the fact that a few degrees difference in body temperature can make the difference between someone who's fine and someone who's in medical danger, the second chart is a better and more honest and useful representation.

If you want to cheat and misrepresent, here's what you should do:

If you want to exaggerate a small difference, don't include zero and zoom into your chart to expand the difference.
If you want to suppress a difference, include zero and choose units that minimize the difference.

If you want to be honest:

Include zero by default.
Don't include zero when you're looking at small changes and the changes matter, in this case, exclude zero to focus on the change.

Extending the axis

This is a really fun way to mislead people and it's something I've only seen recently. You can extend the perception of the axis to reduce the effect. Let's use the same election example I used in my blog post on pie (lie) charts. Imagine there are four parties standing in an election and you have a record of what percentage of the vote each candidate and party received. Here's an honest bar chart showing the results.

Plainly, the Bird party did very badly (15% of the vote). Now let's see if we can minimize the scale of their defeat by redrawing the chart in a deceitful way. Let's remove the x-axis, extend the y-axis labels, color and box the labels, and introduce some bar coloring.

It's still obviously a defeat, but we've made it look much smaller. If you take the time to look, it's obvious that something funky is going on here, but most people don't have time and don't look closely.

If you want to be honest, don't play around with axis labels and colors.

Unequal steps

If you want to imply things are getting worse, or better, when they're not, then a good option is to use unequal axis scaling. Most viewers expect that an axis will scale consistently, for example, an axis might be labeled 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 and more sophisticated viewers might be very comfortable with log scales, for example, axis labels 1,10, 100, 1000. Almost no one can interpret what unequal scaling means, which makes it great for evil. To make your deception even better, use a line chart (which implies continuity) rather than a bar chart (which implies category).

Let's take an example that appeared in the media, US gas prices in 2012. The AAA produces a daily set of gas price data. This has today's price, yesterday's price, last week's price, last month's price, and last year's price. It's not the greatest presentation of data and it's hard to pick out trends, but at least the data exists - and more importantly, they don't chart it. In 2012, a US media outlet (who shall remain nameless) took the data and ran a story on gas price increases under Obama. Here's my version of their chart.

At a quick glance, it looks like there was a massive increase. But was there? The periods on the x-axis aren't equal and they've used a line to indicate a continuous variable. The AAA data quotes last month's number, but that isn't shown here, why? The y axis starts at $2.80 which is an odd choice, more rational choices might have been $3.00 or $0. If you take the time to look at the chart, it's really hard to draw any conclusions, but most people don't have the time and will just conclude 'gas prices up under Obama'.

If you really want to mislead, use unequal scaling and a line chart.

Scale inversion

If you really, really want to mislead, choose a scale inversion.

I'm going to show you one of the most controversial charts of the last ten years. The author has vigorously defended their work, and after reading their comments, I understand that they had no intention to deceive. Because I don't wish to make the author's life more difficult, I'm not going to name them or give you their employer's name.

The chart below shows homicides in Florida and what happened when the 'Stand Your Ground' Law was enacted. Before reading on, how would you interpret the chart?

Almost everyone I've spoken to interprets the chart as implying that homicides went down. But look at the y axis. It's inverted. Here's how the plot would look if the author had chosen normal scaling.

This conveys a hugely different message.

The author wasn't trying to mislead here, rather they were trying to use art to make a more emotionally informative representation of the data. You can judge for yourself whether they succeeded or not. This raises the more general topic of who is visualizing data and how it's done.

In the last few years, there's been a tremendous rise in the use of infographics for all kinds of topics. These tend to be more poster art than information sharing, which leads us to a problem. In the information world, a large number of informal practices have grown up around how to display data in a truthful way. Infographics are sometimes created by people familiar with these practices, but sometimes not. When designers start using artist interpretation to make data more impactful, we can get distortions and unintentionally misleading people. Personally, I think infographics are little more than visual fluff.

Getting back to where I started in this section, scale inversion is a wonderful way of reversing the evidence.

Log plots

This isn't so much deceit as obfuscation or confusion.

A logarithmic scale is one that varies logarithmically, so instead of an axis increasing like 1,2,3,4,5, it increases like 1, 10, 100, 1000, 10000. Logarithmic scales are used when data varies by orders of magnitude.

Unfortunately, many viewers aren't familiar with the idea and it can be hard to interpret, a good example being the recent coronavirus chart in a New York Times article. Here's the chart:

(Imaged credit: New York Times, copyright New York Times)

The logarithmic axis is the y axis. What conclusions would you draw about the coronavirus from this chart? I've used log plots for years and I struggled to understand what this chart means.

2x2 charts

2x2 charts are a special case of confusion with axis. Unfortunately, they're beloved of MBA courses and books on management and marketing. Let's take the classic BCG product matrix as an example. In the 1960s, the consulting company BCG came up with a way for companies to view their product portfolio and make more rational product investment decisions. They recommended plotting market share on the x-axis, growth on the y axis, and dividing the plot into four quadrants, each with a name, you can read more about it here. Here's a representation of their matrix.

Note that although the axes are marked, there's no scale and it's not clear where the quadrant lines are drawn. In practice, companies using this methodology may well draw scales, but in almost all cases you find on the internet, there are no scales.

The BCG matrix is just one of a large number of 2x2 matrices you can find out there. Very few of them have any kind of scale, so it's very hard to understand and interpret what they mean in practice. Bear in mind that they often imply quite different management choices for different chart quadrants, but who's in what quadrant may depend on exactly where the quadrant boundaries are drawn, and that's almost never made clear. It's really tempting to say that you need to employ consultants to tell you what they mean and to interpret the charts for you.

I'm not a fan of 2x2 matrices because I find that they confuse rather than enlighten, but if you want to produce a chart that looks pretty and requires you to interpret it for your management, a 2x2 matrix might well be the place to go.

You can fool all the people some of the time and some of the people all the time

If you know what you're looking for, you can see through deceit or malpractice with some effort. But if you're in a hurry, not paying attention, or a chart is flashed on the screen for a short period of time, a chart with evil axes will probably slip by your defenses against the dark arts.

In many ways, playing around with chart axes is one of the easiest ways to mislead people. I've shown you how people have been evil with axes in the hope that you'll be truthful and honest in your own visualizations.

I'd love to hear what you think about the 'axes of evil'. Have you come across other axis manipulations that I haven't included here?

Engora Data Blog

Tuesday, March 24, 2020

John Snow, cholera, and the origins of data science

Wednesday, March 18, 2020

Contributing to open-source software

Saturday, March 14, 2020

Niche knowledge and power - knowledge hoarding

Wednesday, March 11, 2020

Benford's Law: finding fraud and data oddities

References

Further reading

Thursday, March 5, 2020

How to be a compelling speaker: creating applause and tricolon

References

Reading more

Tuesday, March 3, 2020

Cheating charts: the axes of evil

Zero axis

Extending the axis

Unequal steps

Scale inversion

Log plots

2x2 charts

You can fool all the people some of the time and some of the people all the time